Scale many datasets: Difference between revisions
(copy scripts into wiki, and explain a bit) |
(No difference)
|
Revision as of 20:50, 29 October 2024
Graeme Winter measured 36 datasets of cubic insulin, 12 of which were from pig, cow and human, respectively. These are partial datasets, each consisting of 100 frames of 0.2° oscillation range. The goal of data processing is to process all of them with minimal effort and good results. In addition, the question is if the datasets can be correctly assigned to the organisms (or rather if they can be correctly assigned to 3 different groups), based only on the processed data. For the latter task, the result can be compared with the dataset names.
Step 1: process a single dataset to get an idea of spacegroup, cell and resolution. This was done for the (randomly chosen) X1 dataset. Turns out that it is cubic insulin, spacegroup I213 with cell about 78 78 78 90 90 90. The XDS_ASCII.HKL of that was saved as x1_as_reference.hkl to serve as REFERENCE_DATA_SET for all other datasets, to ensure consistent indexing, because otherwise the possibility of re-indexing would have to be considered. This can be done by xscale_isocluster but it is easier to use a REFERENCE_DATA_SET if one exists.
Step 2: create 36 directories, named according to the unique parts of the filenames. Needs a bit of bash scripting.
#!/bin/bash # find the names distinguishing the datasets, and create directory for each # we just list all frames numbered 100, and remove the constant parts of the filename for i in ../cows-pigs-people/*_00100.cbf.gz do # the next line echoes the filename, and removes the directory part and '_1_00100.cbf.gz' # the result of that is used for "mkdir" mkdir $(echo $i|sed -e 's$../cows-pigs-people/$$' -e 's/_1_00100.cbf.gz//') done
Step 3: process the data by having generate_XDS.INP create XDS.INP for each of them. Then modify that XDS.INP and run xds_par.
#!/bin/bash for i in C* X* P* do echo $i cd $i # start from a default XDS.INP: generate_XDS.INP "../../cows-pigs-people/${i}_1_00???.cbf.gz" >&generate_XDS.INP.log # modifications of XDS.INP # make it read the cbf.gz files a little faster: echo LIB=/usr/local/lib64/xds-zcbf.so >>XDS.INP # if commented in, runs only JOB=CORRECT: # sed -i 's/XYCORR INIT COLSPOT IDXREF DEFPIX INTEGRATE//' XDS.INP # use all frames for COLSPOT instead of only the first half: sed -i 's/SPOT_RANGE=1 50/SPOT_RANGE=1 100/' XDS.INP # use high-resol cutoff of 1.2A according to some preliminary processing: sed -i 's/RESOLUTION_RANGE=50 0/RESOLUTION_RANGE=50 1.2/' XDS.INP # use a reference data set to get consistent indexing: sed -i 's$! REFERENCE_DATA_SET=xxx/XDS_ASCII.HKL $ REFERENCE_DATA_SET= ../x1_as_reference.hkl $' XDS.INP # if using a reference data set, spacegroup and cell constants must be given sed -i 's/SPACE_GROUP_NUMBER=0/SPACE_GROUP_NUMBER= 197/' XDS.INP sed -i 's/UNIT_CELL_CONSTANTS= 70 80 90 90 90 90/UNIT_CELL_CONSTANTS= 78 78 78 90 90 90/' XDS.INP # run xds and write its terminal output to logfile xds_par >&xds.log cd .. done
Running this on a desktop Linux machine with 16 cores takes about 9 minutes.
Step 4: scale and merge with xscale
mkdir xscale cd xscale # create XSCALE.INP. The unit cell parameters were obtained using cellparm (XDS package) from the 36 XDS_ASCII.HKL files echo UNIT_CELL_CONSTANTS=77.864 77.864 77.864 90 90 90 >XSCALE.INP echo SPACE_GROUP_NUMBER=199 >>XSCALE.INP echo OUTPUT_FILE=XSCALE.HKL >XSCALE.INP echo PRINT_CORRELATIONS=FALSE >>XSCALE.INP echo SAVE_CORRECTION_IMAGES=FALSE >> XSCALE.INP ls -C1 ../*/XDS_ASCII.HKL | sed s/^/INPUT_FILE=/ >>XSCALE.INP # run xscale xscale_par
Step 5: analyze resulting XSCALE.HKL to find 3 groups of datasets, representing pig, cow and human insulin, respectively (but of course it is not clear which group is which organism; one could look at the 1.2A electron density maps and compare with sequences).
xscale_isocluster XSCALE.HKL more iso.pdb CRYST1 100.000 100.000 100.000 90.00 90.00 90.00 P 1 HETATM 1 O HOH A 1 99.105 0.039 10.644 1.00100.00 HETATM 2 O HOH A 2 99.176 -0.316 10.098 1.00100.00 HETATM 3 O HOH A 3 98.168 6.104 14.441 1.00100.00 HETATM 4 O HOH A 4 99.078 -0.596 10.852 1.00100.00 HETATM 5 O HOH A 5 99.091 3.549 11.038 1.00100.00 HETATM 6 O HOH A 6 99.247 -1.208 11.008 1.00100.00 HETATM 7 O HOH A 7 99.139 -0.245 11.679 1.00100.00 HETATM 8 O HOH A 8 98.955 3.779 11.552 1.00100.00 HETATM 9 O HOH A 9 99.238 0.969 11.094 1.00100.00 HETATM 10 O HOH A 10 99.080 3.796 11.786 1.00100.00 HETATM 11 O HOH A 11 98.992 5.446 10.487 1.00100.00 HETATM 12 O HOH A 12 99.337 -2.284 9.563 1.00100.00 HETATM 13 O HOH A 13 99.018 9.704 -7.697 1.00100.00 HETATM 14 O HOH A 14 98.667 13.018 -7.167 1.00100.00 HETATM 15 O HOH A 15 98.120 15.082 -8.651 1.00100.00 HETATM 16 O HOH A 16 99.008 9.701 -6.109 1.00100.00 HETATM 17 O HOH A 17 97.573 17.831 -8.148 1.00100.00 HETATM 18 O HOH A 18 97.031 21.266 -4.719 1.00100.00 HETATM 19 O HOH A 19 97.278 18.496 -5.573 1.00100.00 HETATM 20 O HOH A 20 94.039 6.994 -8.273 1.00100.00 HETATM 21 O HOH A 21 98.383 15.300 -6.199 1.00100.00 HETATM 22 O HOH A 22 98.969 8.456 -8.015 1.00100.00 HETATM 23 O HOH A 23 99.213 5.332 -7.997 1.00100.00 HETATM 24 O HOH A 24 99.226 5.837 -8.187 1.00100.00 HETATM 25 O HOH A 25 98.881 -13.077 -3.362 1.00100.00 HETATM 26 O HOH A 26 98.566 -15.863 -3.441 1.00100.00 HETATM 27 O HOH A 27 98.337 -13.694 -8.018 1.00100.00 HETATM 28 O HOH A 28 98.045 -15.069 -8.641 1.00100.00 HETATM 29 O HOH A 29 98.669 -15.425 -3.796 1.00100.00 HETATM 30 O HOH A 30 98.956 -12.814 -3.879 1.00100.00 HETATM 31 O HOH A 31 98.725 -14.816 -1.764 1.00100.00 HETATM 32 O HOH A 32 98.603 -14.161 -5.396 1.00100.00 HETATM 33 O HOH A 33 99.002 -12.392 -3.869 1.00100.00 HETATM 34 O HOH A 34 99.085 -11.848 -3.153 1.00100.00 HETATM 35 O HOH A 35 98.473 -14.979 -3.730 1.00100.00 HETATM 36 O HOH A 36 98.518 -13.849 -4.599 1.00100.00 END
This pseudo-PDB file can be visualized in coot or so and shows three groups, consisting of datasets 1-12, 13-24 and 25-36, respectively.