XSCALE ISOCLUSTER

Revision as of 19:17, 1 May 2017 by Kay (talk | contribs) (→‎Usage)

xscale_isocluster is a program that clusters datasets stored in a single unmerged reflection file as written by XSCALE. It implements the method of Brehm and Diederichs (2014) and theory of Diederichs (2017).

The help output (obtained by using the -h option) is

xscale_isocluster KD 2016-12-20. -h option shows options
Academic use only; no redistribution. Expires 2017-12-31
usage: xscale_isocluster -dmin <lowres> -dmax <highres> -nbin <nbin> -mode <1 or 2> -dim <dim> -clu <#> -cen <#,#,#,...> -<aiw> XSCALE_FILE
dmax, dmin (default from file) and nbin (default 10) have the usual meanings.
mode can be 1 (equal volumes of resolution shells) or 2 (increasing volumes; default).
dim is number of dimensions (default 3).
clu (by default automatically) is number of clusters.
cen (by default automatically) is a set of cluster centers (up to 9) specified by their ISET.
cen must be specified after clu, and the number of values given must match clu.
   -a: base calculations on anomalous (instead of isomorphous) signal
   -i: write pseudo-PDB files to visualize clusters
   -w: no weighting of intensities with their sigmas

Usage

For dataset analysis, the program uses the method of Brehm and Diederichs (2014) Acta Cryst D70, 101-109 (PDF) whose theoretical background is in Diederichs (2017) Acta Cryst D73, 286-293 (open access). This results in an arrangement of N datasets represented by N vectors in a low-dimensional space. Typically, the dimension of that space may be chosen as n=2 to 4, but may be higher if N is large.

n=1 would be suitable if the datasets only differ in their random error. One more dimension is required for each additional systematic property which may vary between the datasets, e.g. n=2 is suitable if they only differ in their indexing mode (which then only should have two alternatives!), or in some other systematic property, like the length of the a axis. Higher values of n (e.g. n=4) are appropriate if e.g. there are 4 indexing possibilities (which is the case in P3x), or more systematic ways in which the datasets may differ (like significant variations in a, b and c axes). In cases where datasets differ e.g. with respect to the composition or conformation of crystallized molecules, it is a priori unknown which value of n should be chosen, and several values need to be tried, and the results inspected.

An attempt to automatically identify clusters of datasets is made. The program writes files called XSCALE.1.INP with lines required for scaling the datasets of cluster 1, and similarly XSCALE.2.INP for cluster 2, and so on. Typically, one may want to create directories cluster1 cluster2 ..., and then establish symlinks (called XSCALE.INP) in these to the XSCALE.#.INP files. This enables separate scaling of each cluster.

Furthermore, a file iso.pdb is produced that should be loaded into coot. Then use Show/Cell and Symmetry/Show unit cell, and visualize the relations between datasets. Optionally, individual iso.x.pdb files can be written for each cluster. For an example, see SSX.

Output

The console output gives informational and error messages. Each file XSCALE.x.INP enumerates the contributing INPUT_FILEs in the order of increasing angular distance. Example:

UNIT_CELL_CONSTANTS=  91.490  91.490   68.790   90.000   90.000  120.000
SPACE_GROUP_NUMBER= 145
OUTPUT_FILE=XSCALE.1.HKL
FRIEDEL'S_LAW=FALSE
SAVE_CORRECTION_IMAGES=FALSE
WFAC1=1
INPUT_FILE=../x4/XDS_ASCII.HKL
!new, old ISET=      1      3 strength,dist,cluster=     0.855     0.035      1
!INCLUDE_RESOLUTION_RANGE=00 00
INPUT_FILE=../x3/XDS_ASCII.HKL
!new, old ISET=      2      2 strength,dist,cluster=     0.861     0.045      1
!INCLUDE_RESOLUTION_RANGE=00 00
INPUT_FILE=../x9/XDS_ASCII.HKL
!new, old ISET=      3      7 strength,dist,cluster=     0.852     0.112      1
!INCLUDE_RESOLUTION_RANGE=00 00
INPUT_FILE=../x7/XDS_ASCII.HKL
!new, old ISET=      4      6 strength,dist,cluster=     0.902     0.155      1
!INCLUDE_RESOLUTION_RANGE=00 00
INPUT_FILE=../x1/XDS_ASCII.HKL
!new, old ISET=      5      1 strength,dist,cluster=     0.749     0.173      1
!INCLUDE_RESOLUTION_RANGE=00 00
INPUT_FILE=../x5/XDS_ASCII.HKL
!new, old ISET=      6      4 strength,dist,cluster=     0.678     0.223      1
!INCLUDE_RESOLUTION_RANGE=00 00
INPUT_FILE=../x6//XDS_ASCII.HKL
!new, old ISET=      7      5 strength,dist,cluster=     0.788     0.406      1
!INCLUDE_RESOLUTION_RANGE=00 00

Each INPUT_FILE line is followed by a comment line. In this, the first two numbers (new and old) refer to the numbering of datasets in the resulting XSCALE.x.INP, versus that in the original XSCALE.INP (which produced XSCALE_FILE). Then, dist refers to arccosine of the angle (e.g. a value of 1.57 would mean 90 degrees) to the center of the cluster (the lower the better/closer), strength refers to vector length which is inversely proportional to the random noise in a data set, and cluster, if negative, identifies a dataset that is outside the core of the cluster. To select good datasets and reject bad ones, the user may comment out INPUT_FILE lines which refer to datasets that are far away in angle or outside the core of the cluster.

Notes

  • For meaningful results, the number of known values (N*(N-1)/2 is the number of pairwise correlation coefficients) should be (preferrably much) higher than the number of unknowns (1+n*(N-1)). This means that one needs at least 5 datasets if dim=2, and at least 7 if dim=3.
  • The clustering of datasets in the low-dimensional space uses the method of Rodriguez and Laio (2014) Science 344, 1492-1496.
  • Limitation: the program does not work if the XSCALE.INP that produced the XSCALE_FILE has more than one OUTPUT_FILE. This is because the dataset numbers in XSCALE_FILE then do not start from 1. Workaround: do several XSCALE runs, one for each OUTPUT_FILE.