Scale many datasets: Difference between revisions

From XDSwiki
Jump to navigation Jump to search
(copy scripts into wiki, and explain a bit)
 
(→‎Step 6: analyze resulting XSCALE.HKL to find 3 groups of datasets: remark about -clu 3 option to xscale_isocluster)
 
(11 intermediate revisions by the same user not shown)
Line 1: Line 1:
Graeme Winter measured 36 datasets of cubic insulin, 12 of which were from pig, cow and human, respectively. These are partial datasets, each consisting of 100 frames of 0.2° oscillation range. The goal of data processing is to process all of them with minimal effort and good results. In addition, the question is if the datasets can be correctly assigned to the organisms (or rather if they can be correctly assigned to 3 different groups), based only on the processed data. For the latter task, the result can be compared with the dataset names.
Graeme Winter measured 36 datasets of cubic insulin, 12 of which were from pig, cow and human, respectively. These are partial datasets, each consisting of 100 frames of 0.2° oscillation range. The goal of data processing is to process all of them with minimal effort and good results. In addition, the question is if the datasets can be correctly assigned to the organisms (or rather if they can be correctly assigned to 3 different groups), based only on the processed data. For the latter task, the result can be compared with the dataset names.


Step 1: process a single dataset to get an idea of spacegroup, cell and resolution. This was done for the (randomly chosen) X1 dataset. Turns out that it is cubic insulin, spacegroup I213 with cell about 78 78 78 90 90 90. The XDS_ASCII.HKL of that was saved as x1_as_reference.hkl to serve as REFERENCE_DATA_SET for all other datasets, to ensure consistent indexing, because otherwise the possibility of re-indexing would have to be considered. This can be done by xscale_isocluster but it is easier to use a REFERENCE_DATA_SET if one exists.
=== Step 1: download the data from Zenodo ===
According to [https://github.com/graeme-winter/dials_tutorials/blob/main/ccp4-aps-2024/COWS_PIGS_PEOPLE.md Graeme's writeup], the [https://zenodo.org/records/13890874 data] (~6GB) were taken on i24 at Diamond Light Source as part of routine commissioning work, with a number of small rotation data sets recorded from different crystals. Crystals were prepared of the protein insulin from cows, pigs and people (as described on the Zenodo deposition; bovine, porcine and human insulin, of course all grown in E.coli anyway).
<pre>
mkdir data
cd data
for set in CIX1_1 CIX2_1 CIX3_1 CIX5_1 CIX6_1 CIX8_1 CIX9_1 CIX10_1 CIX11_1 CIX12_1 CIX14_1 CIX15_1 PIX5_1 PIX6_1 PIX7_1 PIX8_1 PIX9_1 PIX10_1 PIX11_1 PIX12_1 PIX13_1 PIX14_1 PIX15_1 PIX16_1 X1_1 X2_1 X3_1 X4_1 X5_1 X6_1 X7_1 X8_1 X9_1 X11_1 X13_1 X14_1 ; do
wget https://zenodo.org/records/13890874/files/${set}.tar
tar xvf ${set}.tar
rm -v ${set}.tar
done
</pre>
 
=== Step 2: process a single dataset to get an idea of spacegroup, cell and resolution. ===
This was done for the (randomly chosen) X1 dataset. Turns out that it is cubic insulin, spacegroup I213 with cell about 78 78 78 90 90 90. The XDS_ASCII.HKL of that was saved as x1_as_reference.hkl to serve as REFERENCE_DATA_SET for all other datasets, to ensure consistent indexing, because otherwise the possibility of re-indexing would have to be considered. This can be done by xscale_isocluster but it is easier to use a REFERENCE_DATA_SET if one exists.


Step 2: create 36 directories, named according to the unique parts of the filenames. Needs a bit of bash scripting.
=== Step 3: create 36 directories, named according to the unique parts of the filenames. ===
Needs a bit of bash scripting.
<pre>
<pre>
#!/bin/bash
#!/bin/bash
Line 13: Line 27:
# the result of that is used for "mkdir"
# the result of that is used for "mkdir"
   mkdir $(echo $i|sed -e 's$../cows-pigs-people/$$' -e 's/_1_00100.cbf.gz//')
   mkdir $(echo $i|sed -e 's$../cows-pigs-people/$$' -e 's/_1_00100.cbf.gz//')
# (note the use of the $ delimiter instead of / if the pattern has file paths)
done
done
</pre>
</pre>


Step 3: process the data by having generate_XDS.INP create XDS.INP for each of them. Then modify that XDS.INP and run xds_par.
=== Step 4: process the data by having generate_XDS.INP create XDS.INP for each of them. Then modify that XDS.INP and run xds_par. ===
<pre>
<pre>
#!/bin/bash
#!/bin/bash
Line 26: Line 41:
   generate_XDS.INP "../../cows-pigs-people/${i}_1_00???.cbf.gz"  >&generate_XDS.INP.log
   generate_XDS.INP "../../cows-pigs-people/${i}_1_00???.cbf.gz"  >&generate_XDS.INP.log
# modifications of XDS.INP
# modifications of XDS.INP
# make it read the cbf.gz files a little faster:
# make it read the cbf.gz files a little faster: ATTENTION - fill in the correct path!!!
   echo LIB=/usr/local/lib64/xds-zcbf.so >>XDS.INP
   echo LIB=/usr/local/lib64/xds-zcbf.so >>XDS.INP
# if commented in, runs only JOB=CORRECT:
# if commented in, runs only JOB=CORRECT:
#  sed -i 's/XYCORR INIT COLSPOT IDXREF DEFPIX INTEGRATE//' XDS.INP
#  sed -i -e 's/XYCORR INIT COLSPOT IDXREF DEFPIX INTEGRATE//' XDS.INP
# use all frames for COLSPOT instead of only the first half:
# use all frames for COLSPOT instead of only the first half:
   sed -i 's/SPOT_RANGE=1 50/SPOT_RANGE=1 100/' XDS.INP
   sed -i -e 's/SPOT_RANGE=1 50/SPOT_RANGE=1 100/' XDS.INP
# use high-resol cutoff of 1.2A according to some preliminary processing:
# use high-resol cutoff of 1.2A according to some preliminary processing:
   sed -i 's/RESOLUTION_RANGE=50 0/RESOLUTION_RANGE=50 1.2/' XDS.INP
   sed -i -e 's/RESOLUTION_RANGE=50 0/RESOLUTION_RANGE=50 1.2/' XDS.INP
# use a reference data set to get consistent indexing:
# use a reference data set to get consistent indexing:
   sed -i 's$! REFERENCE_DATA_SET=xxx/XDS_ASCII.HKL $ REFERENCE_DATA_SET= ../x1_as_reference.hkl $' XDS.INP
   sed -i -e 's$! REFERENCE_DATA_SET=xxx/XDS_ASCII.HKL $ REFERENCE_DATA_SET= ../x1_as_reference.hkl $' XDS.INP
# (note the use of the $ delimiter instead of / if the pattern has file paths)
# if using a reference data set, spacegroup and cell constants must be given
# if using a reference data set, spacegroup and cell constants must be given
   sed -i 's/SPACE_GROUP_NUMBER=0/SPACE_GROUP_NUMBER= 197/' XDS.INP
   sed -i -e 's/SPACE_GROUP_NUMBER=0/SPACE_GROUP_NUMBER= 197/' XDS.INP
   sed -i 's/UNIT_CELL_CONSTANTS= 70 80 90 90 90 90/UNIT_CELL_CONSTANTS= 78 78 78 90 90 90/' XDS.INP
   sed -i -e 's/UNIT_CELL_CONSTANTS= 70 80 90 90 90 90/UNIT_CELL_CONSTANTS= 78 78 78 90 90 90/' XDS.INP
# run xds and write its terminal output to logfile
# run xds and write its terminal output to logfile
   xds_par  >&xds.log
   xds_par  >&xds.log
Line 44: Line 60:
done
done
</pre>
</pre>
Running this on a desktop Linux machine with 16 cores takes about 9 minutes.
Running this on my 2020 desktop Linux machine with 16 cores takes about 9 minutes. Surprising, takes just as long on my 2020 MacBook Air.


Step 4: scale and merge with xscale
=== Step 5: scale and merge with xscale ===
<pre>
<pre>
mkdir xscale
mkdir xscale
cd xscale
cd xscale
# create XSCALE.INP. The unit cell parameters were obtained using cellparm (XDS package) from the 36 XDS_ASCII.HKL files
# create XSCALE.INP. Precise average cell parameters were obtained using cellparm (XDS package) from the 36 XDS_ASCII.HKL files
echo UNIT_CELL_CONSTANTS=77.864    77.864    77.864 90 90 90 >XSCALE.INP
echo UNIT_CELL_CONSTANTS=77.864    77.864    77.864 90 90 90 >XSCALE.INP
echo SPACE_GROUP_NUMBER=199 >>XSCALE.INP
echo SPACE_GROUP_NUMBER=199 >>XSCALE.INP
Line 61: Line 77:
</pre>
</pre>


Step 5: analyze resulting XSCALE.HKL to find 3 groups of datasets, representing pig, cow and human insulin, respectively (but of course it is not clear which group is which organism; one could look at the 1.2A electron density maps and compare with sequences).
=== Step 6: analyze resulting XSCALE.HKL to find 3 groups of datasets ===
... representing pig, cow and human insulin, respectively (but of course it is not clear which group is which organism; one could look at the 1.2A electron density maps and compare with sequences).
<pre>
<pre>
xscale_isocluster XSCALE.HKL
xscale_isocluster XSCALE.HKL
more iso.pdb
more iso.pdb
# this is a pseudo-PDB file with coordinates x,y,z for each dataset:
CRYST1  100.000  100.000  100.000  90.00  90.00  90.00 P 1
CRYST1  100.000  100.000  100.000  90.00  90.00  90.00 P 1
HETATM    1  O  HOH A  1      99.105  0.039  10.644  1.00100.00
HETATM    1  O  HOH A  1      99.105  0.039  10.644  1.00100.00
Line 104: Line 122:
END
END
</pre>
</pre>
This pseudo-PDB file can be visualized in coot or so and shows three groups, consisting of datasets 1-12, 13-24 and 25-36, respectively.
This pseudo-PDB file can be visualized in coot or so and shows three groups, consisting of datasets 1-12, 13-24 and 25-36, around coordinates (99,0,10), (98,10,-8) and (99,-13,-5), respectively. This sequential ordering agrees with the fact that the datasets were processed according to their names. In other words, the three groups found by [[xscale_isocluster]] correspond to the three different organisms, as expected.
 
An even better way is to run
xscale_isocluster -clu 3 XSCALE.HKL
and this will give you three output files XSCALE.1.INP XSCALE.2.INP XSCALE.3.INP each with the correct 12 datasets.
 
Thanks, Graeme! This is nice and shows the possibility to differentiate between crystals of different but similar content.
 
=== See also ===
[[SSX]]
 
[[SSX-PepT Se|SSX-PepT-Se]]

Latest revision as of 18:04, 30 October 2024

Graeme Winter measured 36 datasets of cubic insulin, 12 of which were from pig, cow and human, respectively. These are partial datasets, each consisting of 100 frames of 0.2° oscillation range. The goal of data processing is to process all of them with minimal effort and good results. In addition, the question is if the datasets can be correctly assigned to the organisms (or rather if they can be correctly assigned to 3 different groups), based only on the processed data. For the latter task, the result can be compared with the dataset names.

Step 1: download the data from Zenodo

According to Graeme's writeup, the data (~6GB) were taken on i24 at Diamond Light Source as part of routine commissioning work, with a number of small rotation data sets recorded from different crystals. Crystals were prepared of the protein insulin from cows, pigs and people (as described on the Zenodo deposition; bovine, porcine and human insulin, of course all grown in E.coli anyway).

mkdir data
cd data
for set in CIX1_1 CIX2_1 CIX3_1 CIX5_1 CIX6_1 CIX8_1 CIX9_1 CIX10_1 CIX11_1 CIX12_1 CIX14_1 CIX15_1 PIX5_1 PIX6_1 PIX7_1 PIX8_1 PIX9_1 PIX10_1 PIX11_1 PIX12_1 PIX13_1 PIX14_1 PIX15_1 PIX16_1 X1_1 X2_1 X3_1 X4_1 X5_1 X6_1 X7_1 X8_1 X9_1 X11_1 X13_1 X14_1 ; do
wget https://zenodo.org/records/13890874/files/${set}.tar
tar xvf ${set}.tar
rm -v ${set}.tar
done

Step 2: process a single dataset to get an idea of spacegroup, cell and resolution.

This was done for the (randomly chosen) X1 dataset. Turns out that it is cubic insulin, spacegroup I213 with cell about 78 78 78 90 90 90. The XDS_ASCII.HKL of that was saved as x1_as_reference.hkl to serve as REFERENCE_DATA_SET for all other datasets, to ensure consistent indexing, because otherwise the possibility of re-indexing would have to be considered. This can be done by xscale_isocluster but it is easier to use a REFERENCE_DATA_SET if one exists.

Step 3: create 36 directories, named according to the unique parts of the filenames.

Needs a bit of bash scripting.

#!/bin/bash
# find the names distinguishing the datasets, and create directory for each
# we just list all frames numbered 100, and remove the constant parts of the filename
for i in ../cows-pigs-people/*_00100.cbf.gz
do 
# the next line echoes the filename, and removes the directory part and '_1_00100.cbf.gz'
# the result of that is used for "mkdir"
  mkdir $(echo $i|sed -e 's$../cows-pigs-people/$$' -e 's/_1_00100.cbf.gz//')
# (note the use of the $ delimiter instead of / if the pattern has file paths)
done

Step 4: process the data by having generate_XDS.INP create XDS.INP for each of them. Then modify that XDS.INP and run xds_par.

#!/bin/bash
for i in C* X* P* 
do  
  echo $i
  cd $i 
# start from a default XDS.INP:
  generate_XDS.INP "../../cows-pigs-people/${i}_1_00???.cbf.gz"  >&generate_XDS.INP.log
# modifications of XDS.INP
# make it read the cbf.gz files a little faster: ATTENTION - fill in the correct path!!!
  echo LIB=/usr/local/lib64/xds-zcbf.so >>XDS.INP
# if commented in, runs only JOB=CORRECT:
#  sed -i -e 's/XYCORR INIT COLSPOT IDXREF DEFPIX INTEGRATE//' XDS.INP
# use all frames for COLSPOT instead of only the first half:
  sed -i -e 's/SPOT_RANGE=1 50/SPOT_RANGE=1 100/' XDS.INP
# use high-resol cutoff of 1.2A according to some preliminary processing:
  sed -i -e 's/RESOLUTION_RANGE=50 0/RESOLUTION_RANGE=50 1.2/' XDS.INP
# use a reference data set to get consistent indexing:
  sed -i -e 's$! REFERENCE_DATA_SET=xxx/XDS_ASCII.HKL $ REFERENCE_DATA_SET= ../x1_as_reference.hkl $' XDS.INP
# (note the use of the $ delimiter instead of / if the pattern has file paths)
# if using a reference data set, spacegroup and cell constants must be given
  sed -i -e 's/SPACE_GROUP_NUMBER=0/SPACE_GROUP_NUMBER= 197/' XDS.INP
  sed -i -e 's/UNIT_CELL_CONSTANTS= 70 80 90 90 90 90/UNIT_CELL_CONSTANTS= 78 78 78 90 90 90/' XDS.INP
# run xds and write its terminal output to logfile
  xds_par  >&xds.log
  cd ..
done

Running this on my 2020 desktop Linux machine with 16 cores takes about 9 minutes. Surprising, takes just as long on my 2020 MacBook Air.

Step 5: scale and merge with xscale

mkdir xscale
cd xscale
# create XSCALE.INP. Precise average cell parameters were obtained using cellparm (XDS package) from the 36 XDS_ASCII.HKL files
echo UNIT_CELL_CONSTANTS=77.864    77.864    77.864 90 90 90 >XSCALE.INP
echo SPACE_GROUP_NUMBER=199 >>XSCALE.INP
echo OUTPUT_FILE=XSCALE.HKL >XSCALE.INP
echo PRINT_CORRELATIONS=FALSE >>XSCALE.INP
echo SAVE_CORRECTION_IMAGES=FALSE >> XSCALE.INP
ls -C1 ../*/XDS_ASCII.HKL | sed s/^/INPUT_FILE=/ >>XSCALE.INP
# run xscale
xscale_par

Step 6: analyze resulting XSCALE.HKL to find 3 groups of datasets

... representing pig, cow and human insulin, respectively (but of course it is not clear which group is which organism; one could look at the 1.2A electron density maps and compare with sequences).

xscale_isocluster XSCALE.HKL
more iso.pdb
# this is a pseudo-PDB file with coordinates x,y,z for each dataset:
CRYST1  100.000  100.000  100.000  90.00  90.00  90.00 P 1
HETATM    1  O   HOH A   1      99.105   0.039  10.644  1.00100.00
HETATM    2  O   HOH A   2      99.176  -0.316  10.098  1.00100.00
HETATM    3  O   HOH A   3      98.168   6.104  14.441  1.00100.00
HETATM    4  O   HOH A   4      99.078  -0.596  10.852  1.00100.00
HETATM    5  O   HOH A   5      99.091   3.549  11.038  1.00100.00
HETATM    6  O   HOH A   6      99.247  -1.208  11.008  1.00100.00
HETATM    7  O   HOH A   7      99.139  -0.245  11.679  1.00100.00
HETATM    8  O   HOH A   8      98.955   3.779  11.552  1.00100.00
HETATM    9  O   HOH A   9      99.238   0.969  11.094  1.00100.00
HETATM   10  O   HOH A  10      99.080   3.796  11.786  1.00100.00
HETATM   11  O   HOH A  11      98.992   5.446  10.487  1.00100.00
HETATM   12  O   HOH A  12      99.337  -2.284   9.563  1.00100.00
HETATM   13  O   HOH A  13      99.018   9.704  -7.697  1.00100.00
HETATM   14  O   HOH A  14      98.667  13.018  -7.167  1.00100.00
HETATM   15  O   HOH A  15      98.120  15.082  -8.651  1.00100.00
HETATM   16  O   HOH A  16      99.008   9.701  -6.109  1.00100.00
HETATM   17  O   HOH A  17      97.573  17.831  -8.148  1.00100.00
HETATM   18  O   HOH A  18      97.031  21.266  -4.719  1.00100.00
HETATM   19  O   HOH A  19      97.278  18.496  -5.573  1.00100.00
HETATM   20  O   HOH A  20      94.039   6.994  -8.273  1.00100.00
HETATM   21  O   HOH A  21      98.383  15.300  -6.199  1.00100.00
HETATM   22  O   HOH A  22      98.969   8.456  -8.015  1.00100.00
HETATM   23  O   HOH A  23      99.213   5.332  -7.997  1.00100.00
HETATM   24  O   HOH A  24      99.226   5.837  -8.187  1.00100.00
HETATM   25  O   HOH A  25      98.881 -13.077  -3.362  1.00100.00
HETATM   26  O   HOH A  26      98.566 -15.863  -3.441  1.00100.00
HETATM   27  O   HOH A  27      98.337 -13.694  -8.018  1.00100.00
HETATM   28  O   HOH A  28      98.045 -15.069  -8.641  1.00100.00
HETATM   29  O   HOH A  29      98.669 -15.425  -3.796  1.00100.00
HETATM   30  O   HOH A  30      98.956 -12.814  -3.879  1.00100.00
HETATM   31  O   HOH A  31      98.725 -14.816  -1.764  1.00100.00
HETATM   32  O   HOH A  32      98.603 -14.161  -5.396  1.00100.00
HETATM   33  O   HOH A  33      99.002 -12.392  -3.869  1.00100.00
HETATM   34  O   HOH A  34      99.085 -11.848  -3.153  1.00100.00
HETATM   35  O   HOH A  35      98.473 -14.979  -3.730  1.00100.00
HETATM   36  O   HOH A  36      98.518 -13.849  -4.599  1.00100.00
END

This pseudo-PDB file can be visualized in coot or so and shows three groups, consisting of datasets 1-12, 13-24 and 25-36, around coordinates (99,0,10), (98,10,-8) and (99,-13,-5), respectively. This sequential ordering agrees with the fact that the datasets were processed according to their names. In other words, the three groups found by xscale_isocluster correspond to the three different organisms, as expected.

An even better way is to run

xscale_isocluster -clu 3 XSCALE.HKL

and this will give you three output files XSCALE.1.INP XSCALE.2.INP XSCALE.3.INP each with the correct 12 datasets.

Thanks, Graeme! This is nice and shows the possibility to differentiate between crystals of different but similar content.

See also

SSX

SSX-PepT-Se