XSCALE ISOCLUSTER

Revision as of 15:57, 6 April 2017 by Kay (talk | contribs)

xscale_isocluster is a program that clusters datasets stored in a single unmerged reflection file as written by XSCALE.

The help output is

xscale_isocluster KD 2016-12-20. -h option shows options
Academic use only; no redistribution. Expires 2017-12-31
usage: xscale_isocluster -dmin <lowres> -dmax <highres> -nbin <nbin> -mode <1 or 2> -dim <dim> -clu <#> -cen <#,#,#,...> -<aiw> XSCALE_FILE_NAME
dmax, dmin (default from file) and nbin (default 10) have the usual meanings.
mode can be 1 (equal volumes of resolution shells) or 2 (increasing volumes; default).
dim is number of dimensions (default 3).
clu (by default automatically) is number of clusters.
cen (by default automatically) is a set of cluster centers (up to 9) specified by their ISET.
cen must be specified after clu, and the number of values given must match clu.
   -a: base calculations on anomalous (instead of isomorphous) signal
   -i: write pseudo-PDB files to visualize clusters
   -w: no weighting of intensities with their sigmas

For dataset analysis, the program uses the method of Brehm and Diederichs (2014) Acta Cryst D70, 101-109 (PDF) whose theoretical background is in Diederichs (2017) Acta Cryst D73, 286-293 (open access). This results in an arrangement of N datasets represented by N vectors in a low-dimensional space. Typically, the dimension of that space may be chosen as n=2 to 4, but may be higher if N is large (see below). n=1 would be suitable if the datasets only differ in their random error. One more dimension is required for each additional systematic property which may vary between the datasets, e.g. n=2 is suitable if they only differ in their indexing mode (which then only should have two alternatives!), or in some other systematic property, like the length of the a axis. Higher values of n (e.g. n=4) are appropriate if e.g. there are 4 indexing possibilities (which is the case in P3x), or more systematic ways in which the datasets may differ (like significant variations in a, b and c axes). In cases where datasets differ e.g. with respect to the composition or conformation of crystallized molecules, it is a priori unknown which value of n should be chosen, and several values need to be tried, and the results inspected.

For meaningful results, the number of known values (N*(N-1)/2 is the number of pairwise correlation coefficients) should be (preferrably much) higher than the number of unknowns (1+n*(N-1)).

The clustering of datasets in the low-dimensional space uses the method of Rodriguez and Laio (2014) Science 344, 1492-1496.