Scale many datasets: Difference between revisions

(5 intermediate revisions by the same user not shown)

Line 14:

=== Step 2: process a single dataset to get an idea of spacegroup, cell and resolution. ===

This was done for the (randomly chosen) X1 dataset. Turns out that it is cubic insulin, spacegroup I213 with cell about 78 78 78 90 90 90. The XDS_ASCII.HKL of that was saved as x1_as_reference.hkl to serve as REFERENCE_DATA_SET for all other datasets, to ensure consistent indexing, because otherwise the possibility of re-indexing would have to be considered. This can be done by xscale_isocluster but it is easier to use a REFERENCE_DATA_SET if one exists.

This was done for the (randomly chosen) X1 dataset. Turns out that it is cubic insulin, spacegroup I213 with cell about 78 78 78 90 90 90. The XDS_ASCII.HKL of that was saved as x1_as_reference.hkl to serve as REFERENCE_DATA_SET for all other datasets, to ensure consistent indexing, because otherwise the possibility of re-indexing would have to be considered. This can be done by [[xscale_isocluster]] but it is easier to use a REFERENCE_DATA_SET if one exists.

=== Step 3: create 36 directories, named according to the unique parts of the filenames. ===

Line 41:

generate_XDS.INP "../../cows-pigs-people/${i}_1_00???.cbf.gz" >&generate_XDS.INP.log

# modifications of XDS.INP

# make it read the cbf.gz files a little faster:

# make it read the cbf.gz files a little faster: ATTENTION - fill in the correct path!!!

echo LIB=/usr/local/lib64/xds-zcbf.so >>XDS.INP

# if commented in, runs only JOB=CORRECT:

# sed -i 's/XYCORR INIT COLSPOT IDXREF DEFPIX INTEGRATE//' XDS.INP

# sed -i -e 's/XYCORR INIT COLSPOT IDXREF DEFPIX INTEGRATE//' XDS.INP

# use all frames for COLSPOT instead of only the first half:

sed -i 's/SPOT_RANGE=1 50/SPOT_RANGE=1 100/' XDS.INP

sed -i -e 's/SPOT_RANGE=1 50/SPOT_RANGE=1 100/' XDS.INP

# use high-resol cutoff of 1.2A according to some preliminary processing:

sed -i 's/RESOLUTION_RANGE=50 0/RESOLUTION_RANGE=50 1.2/' XDS.INP

sed -i -e 's/RESOLUTION_RANGE=50 0/RESOLUTION_RANGE=50 1.2/' XDS.INP

# use a reference data set to get consistent indexing:

sed -i 's$! REFERENCE_DATA_SET=xxx/XDS_ASCII.HKL $ REFERENCE_DATA_SET= ../x1_as_reference.hkl $' XDS.INP

sed -i -e 's$! REFERENCE_DATA_SET=xxx/XDS_ASCII.HKL $ REFERENCE_DATA_SET= ../x1_as_reference.hkl $' XDS.INP

# (note the use of the $ delimiter instead of / if the pattern has file paths)

# if using a reference data set, spacegroup and cell constants must be given

sed -i 's/SPACE_GROUP_NUMBER=0/SPACE_GROUP_NUMBER= 197/' XDS.INP

sed -i -e 's/SPACE_GROUP_NUMBER=0/SPACE_GROUP_NUMBER= 197/' XDS.INP

sed -i 's/UNIT_CELL_CONSTANTS= 70 80 90 90 90 90/UNIT_CELL_CONSTANTS= 78 78 78 90 90 90/' XDS.INP

sed -i -e 's/UNIT_CELL_CONSTANTS= 70 80 90 90 90 90/UNIT_CELL_CONSTANTS= 78 78 78 90 90 90/' XDS.INP

# run xds and write its terminal output to logfile

xds_par >&xds.log

Line 60:

done

</pre>

Running this on my 2020 desktop Linux machine with 16 cores takes about 9 minutes.

Running this on my 2020 desktop Linux machine with 16 cores takes about 9 minutes. Surprising, takes just as long on my 2020 MacBook Air.

=== Step 5: scale and merge with xscale ===

Line 66:

mkdir xscale

cd xscale

# create XSCALE.INP. ~~The unit~~ cell parameters were obtained using cellparm (XDS package) from the 36 XDS_ASCII.HKL files

# create XSCALE.INP. Precise average cell parameters were obtained using cellparm (XDS package) from the 36 XDS_ASCII.HKL files

echo UNIT_CELL_CONSTANTS=77.864 77.864 77.864 90 90 90 >XSCALE.INP

echo SPACE_GROUP_NUMBER=199 >>XSCALE.INP

Line 77:

</pre>

=== Step 6: analyze resulting XSCALE.HKL to find 3 groups of datasets ===

=== Step 6: analyze, using [[xscale_isocluster]], the resulting XSCALE.HKL to find 3 groups of datasets ===

... representing pig, cow and human insulin, respectively (but of course it is not clear which group is which organism; one could look at the 1.2A electron density maps and compare with sequences).

<pre>

xscale_isocluster XSCALE.HKL

more iso.pdb

# this is a pseudo-PDB file with coordinates x,y,z for each dataset:

CRYST1 100.000 100.000 100.000 90.00 90.00 90.00 P 1

HETATM 1 O HOH A 1 99.105 0.039 10.644 1.00100.00

Line 122:

Line 123:

</pre>

This pseudo-PDB file can be visualized in coot or so and shows three groups, consisting of datasets 1-12, 13-24 and 25-36, around coordinates (99,0,10), (98,10,-8) and (99,-13,-5), respectively. This sequential ordering agrees with the fact that the datasets were processed according to their names. In other words, the three groups found by [[xscale_isocluster]] correspond to the three different organisms, as expected.

An even better way is to run

xscale_isocluster -clu 3 XSCALE.HKL

and this will give you three output files XSCALE.1.INP XSCALE.2.INP XSCALE.3.INP each with the correct 12 datasets.

Thanks, Graeme! This is nice and shows the possibility to differentiate between crystals of different but similar content.

@@ Line 14: / Line 14: @@
 === Step 2: process a single dataset to get an idea of spacegroup, cell and resolution. ===
-This was done for the (randomly chosen) X1 dataset. Turns out that it is cubic insulin, spacegroup I213 with cell about 78 78 78 90 90 90. The XDS_ASCII.HKL of that was saved as x1_as_reference.hkl to serve as REFERENCE_DATA_SET for all other datasets, to ensure consistent indexing, because otherwise the possibility of re-indexing would have to be considered. This can be done by xscale_isocluster but it is easier to use a REFERENCE_DATA_SET if one exists.
+This was done for the (randomly chosen) X1 dataset. Turns out that it is cubic insulin, spacegroup I213 with cell about 78 78 78 90 90 90. The XDS_ASCII.HKL of that was saved as x1_as_reference.hkl to serve as REFERENCE_DATA_SET for all other datasets, to ensure consistent indexing, because otherwise the possibility of re-indexing would have to be considered. This can be done by [[xscale_isocluster]] but it is easier to use a REFERENCE_DATA_SET if one exists.
 === Step 3: create 36 directories, named according to the unique parts of the filenames. ===
@@ Line 41: / Line 41: @@
    generate_XDS.INP "../../cows-pigs-people/${i}_1_00???.cbf.gz"  >&generate_XDS.INP.log
 # modifications of XDS.INP
-# make it read the cbf.gz files a little faster:
+# make it read the cbf.gz files a little faster: ATTENTION - fill in the correct path!!!
    echo LIB=/usr/local/lib64/xds-zcbf.so >>XDS.INP
 # if commented in, runs only JOB=CORRECT:
-#  sed -i 's/XYCORR INIT COLSPOT IDXREF DEFPIX INTEGRATE//' XDS.INP
+#  sed -i -e 's/XYCORR INIT COLSPOT IDXREF DEFPIX INTEGRATE//' XDS.INP
 # use all frames for COLSPOT instead of only the first half:
-   sed -i 's/SPOT_RANGE=1 50/SPOT_RANGE=1 100/' XDS.INP
+   sed -i -e 's/SPOT_RANGE=1 50/SPOT_RANGE=1 100/' XDS.INP
 # use high-resol cutoff of 1.2A according to some preliminary processing:
-   sed -i 's/RESOLUTION_RANGE=50 0/RESOLUTION_RANGE=50 1.2/' XDS.INP
+   sed -i -e 's/RESOLUTION_RANGE=50 0/RESOLUTION_RANGE=50 1.2/' XDS.INP
 # use a reference data set to get consistent indexing:
-   sed -i 's$! REFERENCE_DATA_SET=xxx/XDS_ASCII.HKL $ REFERENCE_DATA_SET= ../x1_as_reference.hkl $' XDS.INP
+   sed -i -e 's$! REFERENCE_DATA_SET=xxx/XDS_ASCII.HKL $ REFERENCE_DATA_SET= ../x1_as_reference.hkl $' XDS.INP
 # (note the use of the $ delimiter instead of / if the pattern has file paths)
 # if using a reference data set, spacegroup and cell constants must be given
-   sed -i 's/SPACE_GROUP_NUMBER=0/SPACE_GROUP_NUMBER= 197/' XDS.INP
+   sed -i -e 's/SPACE_GROUP_NUMBER=0/SPACE_GROUP_NUMBER= 197/' XDS.INP
-   sed -i 's/UNIT_CELL_CONSTANTS= 70 80 90 90 90 90/UNIT_CELL_CONSTANTS= 78 78 78 90 90 90/' XDS.INP
+   sed -i -e 's/UNIT_CELL_CONSTANTS= 70 80 90 90 90 90/UNIT_CELL_CONSTANTS= 78 78 78 90 90 90/' XDS.INP
 # run xds and write its terminal output to logfile
    xds_par  >&xds.log
@@ Line 60: / Line 60: @@
 done
 </pre>
-Running this on my 2020 desktop Linux machine with 16 cores takes about 9 minutes.
+Running this on my 2020 desktop Linux machine with 16 cores takes about 9 minutes. Surprising, takes just as long on my 2020 MacBook Air.
 === Step 5: scale and merge with xscale ===
@@ Line 66: / Line 66: @@
 mkdir xscale
 cd xscale
-# create XSCALE.INP. The unit cell parameters were obtained using cellparm (XDS package) from the 36 XDS_ASCII.HKL files
+# create XSCALE.INP. Precise average cell parameters were obtained using cellparm (XDS package) from the 36 XDS_ASCII.HKL files
 echo UNIT_CELL_CONSTANTS=77.864    77.864    77.864 90 90 90 >XSCALE.INP
 echo SPACE_GROUP_NUMBER=199 >>XSCALE.INP
@@ Line 77: / Line 77: @@
 </pre>
-=== Step 6: analyze resulting XSCALE.HKL to find 3 groups of datasets ===
+=== Step 6: analyze, using [[xscale_isocluster]], the resulting XSCALE.HKL to find 3 groups of datasets ===
 ... representing pig, cow and human insulin, respectively (but of course it is not clear which group is which organism; one could look at the 1.2A electron density maps and compare with sequences).
 <pre>
 xscale_isocluster XSCALE.HKL
 more iso.pdb
+# this is a pseudo-PDB file with coordinates x,y,z for each dataset:
 CRYST1  100.000  100.000  100.000  90.00  90.00  90.00 P 1
 HETATM    1  O   HOH A   1      99.105   0.039  10.644  1.00100.00
@@ Line 122: / Line 123: @@
 </pre>
 This pseudo-PDB file can be visualized in coot or so and shows three groups, consisting of datasets 1-12, 13-24 and 25-36, around coordinates (99,0,10), (98,10,-8) and (99,-13,-5), respectively. This sequential ordering agrees with the fact that the datasets were processed according to their names. In other words, the three groups found by [[xscale_isocluster]] correspond to the three different organisms, as expected.
+An even better way is to run
+ xscale_isocluster -clu 3 XSCALE.HKL
+and this will give you three output files XSCALE.1.INP XSCALE.2.INP XSCALE.3.INP each with the correct 12 datasets.
 Thanks, Graeme! This is nice and shows the possibility to differentiate between crystals of different but similar content.