Xds nonisomorphism: Difference between revisions
No edit summary |
No edit summary |
||
Line 1: | Line 1: | ||
[ftp://turn5.biologie.uni-konstanz.de/pub/linux_bin/xds_nonisomorphism xds_nonisomorphism][ftp://turn5.biologie.uni-konstanz.de/pub/sources/xds_nonisomorphism.f90] is a program that analyzes data sets stored in unmerged reflection files (typically called XDS_ASCII.HKL) as written by [[XDS]]. It implements the method of [https://doi.org/10.1107/S1399004713025431 Brehm and Diederichs (2014)] and theory of [https://doi.org/10.1107/S2059798317000699 Diederichs (2017)]. Its purpose is the identification of non-isomorphous (i.e. dissimilar or less well related) data sets among other, more similar data sets. As a consequence of running xds_nonisomorphism, one may choose to only merge the most isomorphous (similar) data sets, and to discard the non-isomorphous ones - or to analyze these separately. | [ftp://turn5.biologie.uni-konstanz.de/pub/linux_bin/xds_nonisomorphism xds_nonisomorphism][ftp://turn5.biologie.uni-konstanz.de/pub/sources/xds_nonisomorphism.f90] is a program that analyzes data sets stored in unmerged reflection files (typically called XDS_ASCII.HKL) as written by [[XDS]]. It implements the method of [https://doi.org/10.1107/S1399004713025431 Brehm and Diederichs (2014)] and theory of [https://doi.org/10.1107/S2059798317000699 Diederichs (2017)]. Its purpose is the identification of non-isomorphous (i.e. dissimilar or less well related) data sets among other, more similar data sets. As a consequence of running xds_nonisomorphism, one may choose to only merge the most isomorphous (similar) data sets, and to discard the non-isomorphous ones - or to analyze these separately. | ||
The assumption is that several data sets exist, and that these should be merged with [[XSCALE]]. The program therefore reads the names of the XDS_ASCII.HKL files from XSCALE.INP . The latter file, and the XDS_ASCII.HKL listed after each INPUT_FILE= line in XSCALE.INP must exist. The program reads the files in the order given, and produces tables with pairwise statistics. | It should be noted that the result of the analyis does not depend on the amount of random error, which means it does not depend on the strengths of data sets - it works just as well for weakly or strongly exposed crystals. | ||
xds_nonisomorphism prints a short help text if the -h option is used. | |||
== Data == | |||
The assumption is that several data sets exist, and that these should be merged with [[XSCALE]]. The program therefore reads the names of the XDS_ASCII.HKL files from XSCALE.INP . The latter file, and the XDS_ASCII.HKL listed after each INPUT_FILE= line in XSCALE.INP must exist. The program reads the files in the order given, and produces tables with pairwise statistics. The method requires data sets with internal multiplicity, and mutual overlap (common reflections) between data sets. | |||
== Calculation == | |||
In particular, for each pair it determines | In particular, for each pair it determines | ||
* the CC* values (Karplus & Diederichs (2012). Science 336, 1030–1033) from the [[CC1/2]] of the data sets (using the σ-τ method of Assmann ''et al.'', J. Appl. Cryst. (2016). 49, 1021–1028), and | * the CC* values (Karplus & Diederichs (2012). Science 336, 1030–1033) from the [[CC1/2]] of the data sets (using the σ-τ method of Assmann ''et al.'', J. Appl. Cryst. (2016). 49, 1021–1028) in columns 3 and 4 of the output, and | ||
* the pairwise (Pearson's) correlation coefficients. | * the pairwise (Pearson's) correlation coefficients (column 5). | ||
As given by equation 2 of [https://doi.org/10.1107/S2059798317000699 Diederichs (2017)], the ratio between the latter quantity and the product of the CC* values of a pair is a measure of the non-isomorphism - for isomorphous data, that ratio is 1. | As given by equation 2 of [https://doi.org/10.1107/S2059798317000699 Diederichs (2017)], the ratio between the latter quantity and the product of the CC* values of a pair is a measure of the non-isomorphism - for isomorphous data, that ratio is 1; for non-isomorphous data, the ratio is lower. This ratio is given in column 6 under the heading "cos(phi)". | ||
== Analysis and interpretation == | |||
Angles (calculated as the inverse cosine of the ratio) are expressed in degrees. Less than 10° may be considered good isomorphism, 90° means highly non-isomorphous (i.e. completely unrelated) datasets. However, the interpretation of the magnitude of an angle depends on the resolution. To account for that, the program uses a formula (McCoy et al. (2017) PNAS 114, 3637-3641 equation 1) that relates coordinate difference to correlation (column 8 of output). This coordinate RMSD value should be independent of resolution. If it is (which is sometimes seen in pairwise comparisons of data sets) then this is an indication that some other systematic difference, that cannot be interpreted as coordinate difference, exists between data sets. Candidates are many kinds of sources of systematic error, e.g. errors in data processing, twinning, overloads, vibrations ... | |||
After the analysis, the program produces a 3D representation of the arrangement of data sets such that their distances in 3D try to reproduce the angles (please note that this is a completely different representation from that of [[xscale_isocluster]]!). | |||
xds_nonisomorphism | == Example and explanation of output == | ||
Create XSCALE.INP (XSCALE does not have to be run at this point!!): | |||
<pre> | |||
OUTPUT_FILE=temp.ahkl | |||
INPUT_FILE=../../xds_317_7.rg4/XDS_ASCII.HKL | |||
INPUT_FILE=../../xds_317_8.rg4/XDS_ASCII.HKL | |||
INPUT_FILE=../../xds_319_7.rg4/XDS_ASCII.HKL | |||
</pre> | |||
Now run xds_nonisomorphism: | |||
<pre> | |||
dikay@turn29:-xscale_rg4/tst% xds_nonisomorphism | |||
xds_nonisomorphism KD 2017-12-03. Academic use only; binary expires 2018-12-31. | |||
Pls cite Diederichs, K. (2017) Acta Cryst D73, 286-293. -h option shows options | |||
reading XSCALE.INP to find XDS_ASCII.HKL-type files | |||
!SPACE_GROUP_NUMBER= 4 | |||
!UNIT_CELL_CONSTANTS= 34.534 57.199 72.346 90.000 90.155 90.000 | |||
iset, dmax, dmin, name: 1 25.012 1.600 ../../xds_317_7.rg4/XDS_ASCII.HKL | |||
iset, dmax, dmin, name: 2 25.012 1.889 ../../xds_317_8.rg4/XDS_ASCII.HKL | |||
iset, dmax, dmin, name: 3 22.917 1.664 ../../xds_319_7.rg4/XDS_ASCII.HKL | |||
iset,nobs,nunique,nunique w/ >1 observations= 1 125760 22536 22350 | |||
iset,nobs,nunique,nunique w/ >1 observations= 2 125086 22303 22114 | |||
iset,nobs,nunique,nunique w/ >1 observations= 3 150588 22496 22357 | |||
Lowest and highest resolution used: 22.917 1.889 | |||
10 resolution shells: | |||
5.800 4.168 3.422 2.972 2.663 2.433 2.255 2.110 1.991 1.889 | |||
</pre> | |||
The data sets have been read, and some basic statistics are produced. Also, the resolution limits of the (default) 10 resolution shells are listed. | |||
<pre> | |||
iset1, iset2= 1 2 | |||
resol_shell nmatch CC*_1 CC*_2 CC(1,2) cos(phi) angle(deg) RMSD_coord | |||
1 768 0.9998 0.9999 0.9966 0.9969 4.4984 0.1761 | |||
2 1297 0.9999 0.9997 0.9670 0.9674 14.6752 0.3422 | |||
3 1672 0.9998 0.9996 0.9787 0.9793 11.6919 0.2113 | |||
4 1903 0.9998 0.9991 0.9751 0.9762 12.5148 0.1920 | |||
5 2234 0.9995 0.9947 0.9702 0.9759 12.5988 0.1709 | |||
6 2508 0.9991 0.9844 0.9260 0.9415 19.6870 0.2432 | |||
7 2575 0.9979 0.9527 0.9042 0.9512 17.9739 0.2041 | |||
8 2785 0.9957 0.8867 0.8418 0.9534 17.5545 0.1854 | |||
9 3023 0.9900 0.7563 0.6880 0.9189 23.2369 0.2322 | |||
10 2966 0.9681 0.5344 0.4395 0.8495 31.8427 0.3057 | |||
</pre> | |||
CC<sup>*</sup>_1 is really high out to the highest resolution, so the first data set (iset1=1) is quite good. CC<sup>*</sup>_2 is weaker at high resolution. cos(phi) = CC(1,2)/(CC<sup>*</sup>_1 * CC<sup>*</sup>_2) in column 6 should ideally be 1, but indicates non-isomorphism here. Converting this to an angle (column 7), only the lowest resolution shell appears "good". However, we can estimate the RMS deviation of coordinates (column 8) giving rise to this amount of non-isomorphism, and these are consistently around 0.2 Å. | |||
<pre> | |||
iset1, iset2= 1 3 | |||
resol_shell nmatch CC*_1 CC*_2 CC(1,2) cos(phi) angle(deg) RMSD_coord | |||
1 781 0.9998 0.9997 0.9931 0.9936 6.5032 0.2544 | |||
2 1329 0.9999 0.9997 0.9691 0.9695 14.1917 0.3312 | |||
3 1613 0.9998 0.9991 0.9476 0.9486 18.4460 0.3347 | |||
4 1947 0.9998 0.9983 0.9880 0.9898 8.1748 0.1251 | |||
5 2246 0.9995 0.9924 0.9747 0.9826 10.6984 0.1449 | |||
6 2518 0.9991 0.9813 0.9536 0.9727 13.4283 0.1650 | |||
7 2578 0.9978 0.9442 0.8971 0.9522 17.7954 0.2020 | |||
8 2819 0.9957 0.8555 0.7925 0.9304 21.5076 0.2282 | |||
9 2936 0.9894 0.6446 0.5191 0.8140 35.5116 0.3620 | |||
10 3216 0.9670 0.3030 0.2574 0.8784 28.5551 0.2721 | |||
</pre> | |||
Similar to the comparison of data sets 1 and 2, except that at low resolution, another source of non-isomorphism appears to dominate. | |||
<pre> | |||
iset1, iset2= 2 3 | |||
resol_shell nmatch CC*_1 CC*_2 CC(1,2) cos(phi) angle(deg) RMSD_coord | |||
1 773 0.9999 0.9997 0.9919 0.9924 7.0822 0.2777 | |||
2 1338 0.9997 0.9997 0.9836 0.9842 10.2144 0.2369 | |||
3 1629 0.9996 0.9991 0.9322 0.9334 21.0351 0.3828 | |||
4 1913 0.9991 0.9984 0.9686 0.9710 13.8272 0.2123 | |||
5 2234 0.9946 0.9923 0.9489 0.9614 15.9703 0.2172 | |||
6 2503 0.9844 0.9812 0.8915 0.9230 22.6311 0.2805 | |||
7 2675 0.9518 0.9425 0.8560 0.9542 17.4034 0.1974 | |||
8 2783 0.8858 0.8568 0.6870 0.9053 25.1411 0.2679 | |||
9 2938 0.7508 0.6453 0.3902 0.8054 36.3539 0.3712 | |||
10 2953 0.5340 0.2847 0.1352 0.8894 27.2020 0.2591 | |||
</pre> | |||
Again similar, except that the coordinates seem to differ a bit more between data sets 2 and 3. | |||
<pre> | |||
using average RMSD values (excluding unreasonable table entries): | |||
dataset #, mean RMSD to all other datasets: 1 0.2341354 | |||
dataset #, mean RMSD to all other datasets: 2 0.2483061 | |||
dataset #, mean RMSD to all other datasets: 3 0.2561350 | |||
central dataset (most isomorphous) is number 1 | |||
most distant dataset (least isom.) is number 3 | |||
RMSD= lines in XSCALE.INP.rename_me will be specified w.r.t. to central dataset | |||
Jacobi it_num,num_rot: 8 10 | |||
Eigenvalues: -1.1685427E-09 2.4050672E-02 3.6891516E-02 | |||
coordinates in 3D that best reproduce the angles as distances: | |||
-2.6219051E-02 -0.1248424 0.0000000E+00 | |||
-0.1207941 8.0754802E-02 0.0000000E+00 | |||
0.1470131 4.4087593E-02 0.0000000E+00 | |||
wrote noniso.pdb | |||
</pre> | |||
noniso.pdb is a pseudo-PDB file, with each data set represented as an atom position; it could/should be loaded into coot. It can be seen that the three data sets form an equal-sided triangle; there is no hint that two of them are close to each other but far from the remaining one, so that one of them could/should be discarded. | |||
<pre> | |||
noniso.pdb=representation of data set arrangement in 3D (coords*100) | |||
wrote XSCALE.INP.rename_me with additional RMSD= lines | |||
</pre> | |||
(Currently, the XSCALE.INP.rename_me file that xds_nonisomorphism writes is useless, because XSCALE does not understand the RMSD lines.) | |||
For completeness, this is noniso.pdb: | |||
<pre> | |||
CRYST1 100.000 100.000 100.000 90.00 90.00 90.00 P 1 | |||
HETATM 1 O HOH A 1 -2.622 -12.484 0.000 1.0000.00 | |||
HETATM 2 O HOH A 2 -12.079 8.075 0.000 1.0000.00 | |||
HETATM 3 O HOH A 3 14.701 4.409 0.000 1.0000.00 | |||
</pre> |
Revision as of 14:56, 17 May 2018
xds_nonisomorphism[1] is a program that analyzes data sets stored in unmerged reflection files (typically called XDS_ASCII.HKL) as written by XDS. It implements the method of Brehm and Diederichs (2014) and theory of Diederichs (2017). Its purpose is the identification of non-isomorphous (i.e. dissimilar or less well related) data sets among other, more similar data sets. As a consequence of running xds_nonisomorphism, one may choose to only merge the most isomorphous (similar) data sets, and to discard the non-isomorphous ones - or to analyze these separately.
It should be noted that the result of the analyis does not depend on the amount of random error, which means it does not depend on the strengths of data sets - it works just as well for weakly or strongly exposed crystals.
xds_nonisomorphism prints a short help text if the -h option is used.
Data
The assumption is that several data sets exist, and that these should be merged with XSCALE. The program therefore reads the names of the XDS_ASCII.HKL files from XSCALE.INP . The latter file, and the XDS_ASCII.HKL listed after each INPUT_FILE= line in XSCALE.INP must exist. The program reads the files in the order given, and produces tables with pairwise statistics. The method requires data sets with internal multiplicity, and mutual overlap (common reflections) between data sets.
Calculation
In particular, for each pair it determines
- the CC* values (Karplus & Diederichs (2012). Science 336, 1030–1033) from the CC1/2 of the data sets (using the σ-τ method of Assmann et al., J. Appl. Cryst. (2016). 49, 1021–1028) in columns 3 and 4 of the output, and
- the pairwise (Pearson's) correlation coefficients (column 5).
As given by equation 2 of Diederichs (2017), the ratio between the latter quantity and the product of the CC* values of a pair is a measure of the non-isomorphism - for isomorphous data, that ratio is 1; for non-isomorphous data, the ratio is lower. This ratio is given in column 6 under the heading "cos(phi)".
Analysis and interpretation
Angles (calculated as the inverse cosine of the ratio) are expressed in degrees. Less than 10° may be considered good isomorphism, 90° means highly non-isomorphous (i.e. completely unrelated) datasets. However, the interpretation of the magnitude of an angle depends on the resolution. To account for that, the program uses a formula (McCoy et al. (2017) PNAS 114, 3637-3641 equation 1) that relates coordinate difference to correlation (column 8 of output). This coordinate RMSD value should be independent of resolution. If it is (which is sometimes seen in pairwise comparisons of data sets) then this is an indication that some other systematic difference, that cannot be interpreted as coordinate difference, exists between data sets. Candidates are many kinds of sources of systematic error, e.g. errors in data processing, twinning, overloads, vibrations ...
After the analysis, the program produces a 3D representation of the arrangement of data sets such that their distances in 3D try to reproduce the angles (please note that this is a completely different representation from that of xscale_isocluster!).
Example and explanation of output
Create XSCALE.INP (XSCALE does not have to be run at this point!!):
OUTPUT_FILE=temp.ahkl INPUT_FILE=../../xds_317_7.rg4/XDS_ASCII.HKL INPUT_FILE=../../xds_317_8.rg4/XDS_ASCII.HKL INPUT_FILE=../../xds_319_7.rg4/XDS_ASCII.HKL
Now run xds_nonisomorphism:
dikay@turn29:-xscale_rg4/tst% xds_nonisomorphism xds_nonisomorphism KD 2017-12-03. Academic use only; binary expires 2018-12-31. Pls cite Diederichs, K. (2017) Acta Cryst D73, 286-293. -h option shows options reading XSCALE.INP to find XDS_ASCII.HKL-type files !SPACE_GROUP_NUMBER= 4 !UNIT_CELL_CONSTANTS= 34.534 57.199 72.346 90.000 90.155 90.000 iset, dmax, dmin, name: 1 25.012 1.600 ../../xds_317_7.rg4/XDS_ASCII.HKL iset, dmax, dmin, name: 2 25.012 1.889 ../../xds_317_8.rg4/XDS_ASCII.HKL iset, dmax, dmin, name: 3 22.917 1.664 ../../xds_319_7.rg4/XDS_ASCII.HKL iset,nobs,nunique,nunique w/ >1 observations= 1 125760 22536 22350 iset,nobs,nunique,nunique w/ >1 observations= 2 125086 22303 22114 iset,nobs,nunique,nunique w/ >1 observations= 3 150588 22496 22357 Lowest and highest resolution used: 22.917 1.889 10 resolution shells: 5.800 4.168 3.422 2.972 2.663 2.433 2.255 2.110 1.991 1.889
The data sets have been read, and some basic statistics are produced. Also, the resolution limits of the (default) 10 resolution shells are listed.
iset1, iset2= 1 2 resol_shell nmatch CC*_1 CC*_2 CC(1,2) cos(phi) angle(deg) RMSD_coord 1 768 0.9998 0.9999 0.9966 0.9969 4.4984 0.1761 2 1297 0.9999 0.9997 0.9670 0.9674 14.6752 0.3422 3 1672 0.9998 0.9996 0.9787 0.9793 11.6919 0.2113 4 1903 0.9998 0.9991 0.9751 0.9762 12.5148 0.1920 5 2234 0.9995 0.9947 0.9702 0.9759 12.5988 0.1709 6 2508 0.9991 0.9844 0.9260 0.9415 19.6870 0.2432 7 2575 0.9979 0.9527 0.9042 0.9512 17.9739 0.2041 8 2785 0.9957 0.8867 0.8418 0.9534 17.5545 0.1854 9 3023 0.9900 0.7563 0.6880 0.9189 23.2369 0.2322 10 2966 0.9681 0.5344 0.4395 0.8495 31.8427 0.3057
CC*_1 is really high out to the highest resolution, so the first data set (iset1=1) is quite good. CC*_2 is weaker at high resolution. cos(phi) = CC(1,2)/(CC*_1 * CC*_2) in column 6 should ideally be 1, but indicates non-isomorphism here. Converting this to an angle (column 7), only the lowest resolution shell appears "good". However, we can estimate the RMS deviation of coordinates (column 8) giving rise to this amount of non-isomorphism, and these are consistently around 0.2 Å.
iset1, iset2= 1 3 resol_shell nmatch CC*_1 CC*_2 CC(1,2) cos(phi) angle(deg) RMSD_coord 1 781 0.9998 0.9997 0.9931 0.9936 6.5032 0.2544 2 1329 0.9999 0.9997 0.9691 0.9695 14.1917 0.3312 3 1613 0.9998 0.9991 0.9476 0.9486 18.4460 0.3347 4 1947 0.9998 0.9983 0.9880 0.9898 8.1748 0.1251 5 2246 0.9995 0.9924 0.9747 0.9826 10.6984 0.1449 6 2518 0.9991 0.9813 0.9536 0.9727 13.4283 0.1650 7 2578 0.9978 0.9442 0.8971 0.9522 17.7954 0.2020 8 2819 0.9957 0.8555 0.7925 0.9304 21.5076 0.2282 9 2936 0.9894 0.6446 0.5191 0.8140 35.5116 0.3620 10 3216 0.9670 0.3030 0.2574 0.8784 28.5551 0.2721
Similar to the comparison of data sets 1 and 2, except that at low resolution, another source of non-isomorphism appears to dominate.
iset1, iset2= 2 3 resol_shell nmatch CC*_1 CC*_2 CC(1,2) cos(phi) angle(deg) RMSD_coord 1 773 0.9999 0.9997 0.9919 0.9924 7.0822 0.2777 2 1338 0.9997 0.9997 0.9836 0.9842 10.2144 0.2369 3 1629 0.9996 0.9991 0.9322 0.9334 21.0351 0.3828 4 1913 0.9991 0.9984 0.9686 0.9710 13.8272 0.2123 5 2234 0.9946 0.9923 0.9489 0.9614 15.9703 0.2172 6 2503 0.9844 0.9812 0.8915 0.9230 22.6311 0.2805 7 2675 0.9518 0.9425 0.8560 0.9542 17.4034 0.1974 8 2783 0.8858 0.8568 0.6870 0.9053 25.1411 0.2679 9 2938 0.7508 0.6453 0.3902 0.8054 36.3539 0.3712 10 2953 0.5340 0.2847 0.1352 0.8894 27.2020 0.2591
Again similar, except that the coordinates seem to differ a bit more between data sets 2 and 3.
using average RMSD values (excluding unreasonable table entries): dataset #, mean RMSD to all other datasets: 1 0.2341354 dataset #, mean RMSD to all other datasets: 2 0.2483061 dataset #, mean RMSD to all other datasets: 3 0.2561350 central dataset (most isomorphous) is number 1 most distant dataset (least isom.) is number 3 RMSD= lines in XSCALE.INP.rename_me will be specified w.r.t. to central dataset Jacobi it_num,num_rot: 8 10 Eigenvalues: -1.1685427E-09 2.4050672E-02 3.6891516E-02 coordinates in 3D that best reproduce the angles as distances: -2.6219051E-02 -0.1248424 0.0000000E+00 -0.1207941 8.0754802E-02 0.0000000E+00 0.1470131 4.4087593E-02 0.0000000E+00 wrote noniso.pdb
noniso.pdb is a pseudo-PDB file, with each data set represented as an atom position; it could/should be loaded into coot. It can be seen that the three data sets form an equal-sided triangle; there is no hint that two of them are close to each other but far from the remaining one, so that one of them could/should be discarded.
noniso.pdb=representation of data set arrangement in 3D (coords*100) wrote XSCALE.INP.rename_me with additional RMSD= lines
(Currently, the XSCALE.INP.rename_me file that xds_nonisomorphism writes is useless, because XSCALE does not understand the RMSD lines.)
For completeness, this is noniso.pdb:
CRYST1 100.000 100.000 100.000 90.00 90.00 90.00 P 1 HETATM 1 O HOH A 1 -2.622 -12.484 0.000 1.0000.00 HETATM 2 O HOH A 2 -12.079 8.075 0.000 1.0000.00 HETATM 3 O HOH A 3 14.701 4.409 0.000 1.0000.00