XSCALE ISOCLUSTER: Difference between revisions

m
Copy-edit pass by Frances C. Bernstein, entered by HJB
m (Copy-edit pass by Frances C. Bernstein, entered by HJB)
Line 21: Line 21:
For dataset analysis, the program uses the method of [https://dx.doi.org/10.1107/S1399004713025431 Brehm and Diederichs (2014) ''Acta Cryst'' '''D70''', 101-109] ([https://kops.uni-konstanz.de/bitstream/handle/123456789/26319/Brehm_263191.pdf?sequence=2&isAllowed=y PDF]) whose theoretical background is in [https://doi.org/10.1107/S2059798317000699 Diederichs (2017) ''Acta Cryst'' '''D73''', 286-293] (open access). This results in an arrangement of N datasets represented by N vectors in a low-dimensional space. Typically, the dimension of that space may be chosen as n=2 to 4, but may be higher if N is large.  
For dataset analysis, the program uses the method of [https://dx.doi.org/10.1107/S1399004713025431 Brehm and Diederichs (2014) ''Acta Cryst'' '''D70''', 101-109] ([https://kops.uni-konstanz.de/bitstream/handle/123456789/26319/Brehm_263191.pdf?sequence=2&isAllowed=y PDF]) whose theoretical background is in [https://doi.org/10.1107/S2059798317000699 Diederichs (2017) ''Acta Cryst'' '''D73''', 286-293] (open access). This results in an arrangement of N datasets represented by N vectors in a low-dimensional space. Typically, the dimension of that space may be chosen as n=2 to 4, but may be higher if N is large.  


n=1 would be suitable if the datasets only differ in their random error.  One more dimension is required for each additional systematic property which may vary between the datasets, e.g. n=2 is suitable if they only differ in their indexing mode (which then only should have two alternatives!), or in some other systematic property, like the length of the a axis. Higher values of n (e.g. n=4) are appropriate if e.g. there are 4 indexing possibilities (which is the case in P3<sub>x</sub>), or more systematic ways in which the datasets may differ (like significant variations in a, b and c axes). In cases where datasets differ e.g. with respect to the composition or conformation of crystallized molecules, it is ''a priori'' unknown which value of n should be chosen, and several values need to be tried, and the results inspected.
n=1 would be suitable if the datasets only differ in their random error.  One more dimension is required for each additional systematic property which may vary between the datasets, e.g. n=2 is suitable if they only differ in their indexing mode (which then only should have two alternatives!), or in some other systematic property, like the length of the a axis. Higher values of n (e.g. n=4) are appropriate if e.g. there are 4 indexing possibilities (which is the case in P3<sub>x</sub>), or more systematic ways in which the datasets may differ (like significant variations in the a, b and c axes). In cases where datasets differ e.g. with respect to the composition or conformation of crystallized molecules, it is ''a priori'' unknown which value of n should be chosen, so several values need to be tried, and the results inspected.


An attempt to automatically identify clusters of datasets is made. The program writes files called XSCALE.1.INP with lines required for scaling the datasets of cluster 1, and similarly XSCALE.2.INP for cluster 2, and so on. Typically, one may want to create directories cluster1 cluster2 ..., and then establish symlinks (called XSCALE.INP) in these to the XSCALE.#.INP files. This enables separate scaling of each cluster.
An attempt is made to automatically identify clusters of datasets. The program writes files called XSCALE.1.INP with lines required for scaling the datasets of cluster 1, and similarly XSCALE.2.INP for cluster 2, and so on. Typically, one may want to create directories cluster1 cluster2 ..., and then establish symlinks (called XSCALE.INP) in these to the XSCALE.#.INP files. This enables separate scaling of each cluster.


Furthermore, a file iso.pdb is produced that should be loaded into coot. Then use Show/Cell and Symmetry/Show unit cell, and visualize the relations between datasets. Optionally, individual iso.x.pdb files can be written for each cluster. For an example, see [[SSX]].
Furthermore, a file iso.pdb is produced that should be loaded into coot. Then use Show/Cell and Symmetry/Show unit cell, and visualize the relations between datasets. Optionally, individual iso.x.pdb files can be written for each cluster. For an example, see [[SSX]].
Line 64: Line 64:
== Notes ==
== Notes ==
* For meaningful results, the number of known values (N*(N-1)/2 is the number of pairwise correlation coefficients) should be (preferrably much) higher than the number of unknowns (1+n*(N-1)). This means that one needs at least 5 datasets if dim=2, and at least 7 if dim=3.
* For meaningful results, the number of known values (N*(N-1)/2 is the number of pairwise correlation coefficients) should be (preferrably much) higher than the number of unknowns (1+n*(N-1)). This means that one needs at least 5 datasets if dim=2, and at least 7 if dim=3.
* The clustering of datasets in the low-dimensional space uses the method of Rodriguez and Laio (2014) ''Science'' '''344''', 1492-1496.
* The clustering of datasets in a low-dimensional space uses the method of Rodriguez and Laio (2014) ''Science'' '''344''', 1492-1496.
* Limitation: the program does not work if the XSCALE.INP that produced the XSCALE_FILE has more than one OUTPUT_FILE. This is because the dataset numbers in XSCALE_FILE then do not start from 1. Workaround: do several XSCALE runs, one for each OUTPUT_FILE.
* Limitation: the program does not work if the XSCALE.INP that produced the XSCALE_FILE has more than one OUTPUT_FILE. This is because the dataset numbers in XSCALE_FILE then do not start from 1. Workaround: do several XSCALE runs, one for each OUTPUT_FILE.
1

edit