XSCALE ISOCLUSTER: Difference between revisions
No edit summary |
No edit summary |
||
Line 1: | Line 1: | ||
[ftp://turn5.biologie.uni-konstanz.de/pub/xscale_isocluster_linux.bz2 xscale_isocluster] is a program that clusters datasets stored in a single unmerged reflection file as written by [[XSCALE]]. | [ftp://turn5.biologie.uni-konstanz.de/pub/xscale_isocluster_linux.bz2 xscale_isocluster] is a program that clusters datasets stored in a single unmerged reflection file as written by [[XSCALE]]. | ||
The help output is | The help output (obtained by using the <code>-h</code> option) is | ||
<pre> | <pre> | ||
xscale_isocluster KD 2016-12-20. -h option shows options | xscale_isocluster KD 2016-12-20. -h option shows options | ||
Line 19: | Line 19: | ||
For dataset analysis, the program uses the method of [https://dx.doi.org/10.1107/S1399004713025431 Brehm and Diederichs (2014) ''Acta Cryst'' '''D70''', 101-109] ([https://kops.uni-konstanz.de/bitstream/handle/123456789/26319/Brehm_263191.pdf?sequence=2&isAllowed=y PDF]) whose theoretical background is in [https://doi.org/10.1107/S2059798317000699 Diederichs (2017) ''Acta Cryst'' '''D73''', 286-293] (open access). This results in an arrangement of N datasets represented by N vectors in a low-dimensional space. Typically, the dimension of that space may be chosen as n=2 to 4, but may be higher if N is large (see below). n=1 would be suitable if the datasets only differ in their random error. One more dimension is required for each additional systematic property which may vary between the datasets, e.g. n=2 is suitable if they only differ in their indexing mode (which then only should have two alternatives!), or in some other systematic property, like the length of the a axis. Higher values of n (e.g. n=4) are appropriate if e.g. there are 4 indexing possibilities (which is the case in P3<sub>x</sub>), or more systematic ways in which the datasets may differ (like significant variations in a, b and c axes). In cases where datasets differ e.g. with respect to the composition or conformation of crystallized molecules, it is ''a priori'' unknown which value of n should be chosen, and several values need to be tried, and the results inspected. | For dataset analysis, the program uses the method of [https://dx.doi.org/10.1107/S1399004713025431 Brehm and Diederichs (2014) ''Acta Cryst'' '''D70''', 101-109] ([https://kops.uni-konstanz.de/bitstream/handle/123456789/26319/Brehm_263191.pdf?sequence=2&isAllowed=y PDF]) whose theoretical background is in [https://doi.org/10.1107/S2059798317000699 Diederichs (2017) ''Acta Cryst'' '''D73''', 286-293] (open access). This results in an arrangement of N datasets represented by N vectors in a low-dimensional space. Typically, the dimension of that space may be chosen as n=2 to 4, but may be higher if N is large (see below). n=1 would be suitable if the datasets only differ in their random error. One more dimension is required for each additional systematic property which may vary between the datasets, e.g. n=2 is suitable if they only differ in their indexing mode (which then only should have two alternatives!), or in some other systematic property, like the length of the a axis. Higher values of n (e.g. n=4) are appropriate if e.g. there are 4 indexing possibilities (which is the case in P3<sub>x</sub>), or more systematic ways in which the datasets may differ (like significant variations in a, b and c axes). In cases where datasets differ e.g. with respect to the composition or conformation of crystallized molecules, it is ''a priori'' unknown which value of n should be chosen, and several values need to be tried, and the results inspected. | ||
For meaningful results, the number of known values (N*(N-1)/2 is the number of pairwise correlation coefficients) should be (preferrably much) higher than the number of unknowns (1+n*(N-1)). | The program writes files called XSCALE.1.INP with lines required for scaling the datasets of cluster 1, and similarly XSCALE.2.INP for cluster 2, and so on. Typically, one should create subdirectories 1 2 ..., and then create symlinks in these called XSCALE.INP to the XSCALE.#.INP files. This enables separate scaling of each cluster. | ||
For meaningful results, the number of known values (N*(N-1)/2 is the number of pairwise correlation coefficients) should be (preferrably much) higher than the number of unknowns (1+n*(N-1)). This means that one needs at least 5 datasets if dim=2, and at least 7 if dim=3. | |||
The clustering of datasets in the low-dimensional space uses the method of Rodriguez and Laio (2014) ''Science'' '''344''', 1492-1496. | The clustering of datasets in the low-dimensional space uses the method of Rodriguez and Laio (2014) ''Science'' '''344''', 1492-1496. |
Revision as of 20:20, 24 April 2017
xscale_isocluster is a program that clusters datasets stored in a single unmerged reflection file as written by XSCALE.
The help output (obtained by using the -h
option) is
xscale_isocluster KD 2016-12-20. -h option shows options Academic use only; no redistribution. Expires 2017-12-31 usage: xscale_isocluster -dmin <lowres> -dmax <highres> -nbin <nbin> -mode <1 or 2> -dim <dim> -clu <#> -cen <#,#,#,...> -<aiw> XSCALE_FILE_NAME dmax, dmin (default from file) and nbin (default 10) have the usual meanings. mode can be 1 (equal volumes of resolution shells) or 2 (increasing volumes; default). dim is number of dimensions (default 3). clu (by default automatically) is number of clusters. cen (by default automatically) is a set of cluster centers (up to 9) specified by their ISET. cen must be specified after clu, and the number of values given must match clu. -a: base calculations on anomalous (instead of isomorphous) signal -i: write pseudo-PDB files to visualize clusters -w: no weighting of intensities with their sigmas
For dataset analysis, the program uses the method of Brehm and Diederichs (2014) Acta Cryst D70, 101-109 (PDF) whose theoretical background is in Diederichs (2017) Acta Cryst D73, 286-293 (open access). This results in an arrangement of N datasets represented by N vectors in a low-dimensional space. Typically, the dimension of that space may be chosen as n=2 to 4, but may be higher if N is large (see below). n=1 would be suitable if the datasets only differ in their random error. One more dimension is required for each additional systematic property which may vary between the datasets, e.g. n=2 is suitable if they only differ in their indexing mode (which then only should have two alternatives!), or in some other systematic property, like the length of the a axis. Higher values of n (e.g. n=4) are appropriate if e.g. there are 4 indexing possibilities (which is the case in P3x), or more systematic ways in which the datasets may differ (like significant variations in a, b and c axes). In cases where datasets differ e.g. with respect to the composition or conformation of crystallized molecules, it is a priori unknown which value of n should be chosen, and several values need to be tried, and the results inspected.
The program writes files called XSCALE.1.INP with lines required for scaling the datasets of cluster 1, and similarly XSCALE.2.INP for cluster 2, and so on. Typically, one should create subdirectories 1 2 ..., and then create symlinks in these called XSCALE.INP to the XSCALE.#.INP files. This enables separate scaling of each cluster.
For meaningful results, the number of known values (N*(N-1)/2 is the number of pairwise correlation coefficients) should be (preferrably much) higher than the number of unknowns (1+n*(N-1)). This means that one needs at least 5 datasets if dim=2, and at least 7 if dim=3.
The clustering of datasets in the low-dimensional space uses the method of Rodriguez and Laio (2014) Science 344, 1492-1496.