XSCALE ISOCLUSTER: Difference between revisions

From XDSwiki
Jump to navigation Jump to search
No edit summary
 
(18 intermediate revisions by 2 users not shown)
Line 1: Line 1:
[ftp://turn5.biologie.uni-konstanz.de/pub/xscale_isocluster_linux.bz2 xscale_isocluster] is a program that clusters datasets stored in a single unmerged reflection file as written by [[XSCALE]]. It implements the method of [https://doi.org/10.1107/S1399004713025431 Brehm and Diederichs (2014)] and theory of [https://doi.org/10.1107/S2059798317000699 Diederichs (2017)].
xscale_isocluster [https://{{SERVERNAME}}/pub/linux_bin/xscale_isocluster (Linux binary)][https://{{SERVERNAME}}/pub/mac_bin/xscale_isocluster (Mac binary)] is a program that clusters datasets stored in a single unmerged reflection file as written by [[XSCALE]]. It implements the method of [https://doi.org/10.1107/S1399004713025431 Brehm and Diederichs (2014)] and theory of [https://doi.org/10.1107/S2059798317000699 Diederichs (2017)].


The help output (obtained by using the <code>-h</code> option) is
The help output (obtained by using the <code>-h</code> option) is
<pre>
<pre>
xscale_isocluster KD 2016-12-20. -h option shows options
xscale_isocluster KD 2020-07-27. -h option shows options
Academic use only; no redistribution. Expires 2017-12-31
Licensed for academic use only; no redistribution.
usage: xscale_isocluster -dmin <lowres> -dmax <highres> -nbin <nbin> -mode <1 or 2> -dim <dim> -clu <#> -cen <#,#,#,...> -<aiw> XSCALE_FILE
Please cite Assmann,Wang and Diederichs (2020) Acta Cryst D76, 636.
dmax, dmin (default from file) and nbin (default 10) have the usual meanings.
usage: xscale_isocluster -dmin <lowres> -dmax <highres> -nbin <nbin> -mode <1 or 2> -dim <dim> -clu <#> -cen <#,#,#,...> -<aAisw> XSCALE_FILE_NAME
highres, lowres and nbin (default=1 for low, and 10 for high multiplicity) have the usual meanings.
The default of highres and lowres is the COMMON resolution range of all datasets in XSCALE_FILE.
mode can be 1 (equal volumes of resolution shells) or 2 (increasing volumes; default).
mode can be 1 (equal volumes of resolution shells) or 2 (increasing volumes; default).
dim is number of dimensions (default 3).
dim is number of dimensions (default 3).
clu (by default automatically) is number of clusters.
clu (by default 1) is number of clusters.
cen (by default automatically) is a set of cluster centers (up to 9) specified by their ISET.
cen (by default automatically) is a set of cluster centers (up to 9) specified by their ISET.
cen must be specified after clu, and the number of values given must match clu.
cen must be specified after clu, and the number of values given must match clu.
  -a: base calculations on anomalous (instead of isomorphous) signal
-a: use anomalous (instead of isomorphous) signal only (unsuitable for partial datasets)
  -i: write pseudo-PDB files to visualize clusters
-A: account for anomalous signal by separating I+ and I- (suitable for partial datasets)
  -w: no weighting of intensities with their sigmas
-i: write individual pseudo-PDB files to visualize clusters
-s: scale (default 1) for WEIGHT=1/cos(scale*angle) values written to XSCALE.*.INP file(s)
-w: no weighting of intensities with their sigmas
The XSCALE.?.INP output files have angles [deg] w.r.t. cluster centers
 
</pre>
</pre>


== Usage ==
== Usage ==


For dataset analysis, the program uses the method of [https://dx.doi.org/10.1107/S1399004713025431 Brehm and Diederichs (2014) ''Acta Cryst'' '''D70''', 101-109] ([https://kops.uni-konstanz.de/bitstream/handle/123456789/26319/Brehm_263191.pdf?sequence=2&isAllowed=y PDF]) whose theoretical background is in [https://doi.org/10.1107/S2059798317000699 Diederichs (2017) ''Acta Cryst'' '''D73''', 286-293] (open access). This results in an arrangement of N datasets represented by N vectors in a low-dimensional space. Typically, the dimension of that space may be chosen as n=2 to 4, but may be higher if N is large.  
For data set analysis, the program uses the method of [https://dx.doi.org/10.1107/S1399004713025431 Brehm and Diederichs (2014) ''Acta Cryst'' '''D70''', 101-109] ([https://kops.uni-konstanz.de/bitstream/handle/123456789/26319/Brehm_263191.pdf?sequence=2&isAllowed=y PDF]) whose theoretical background is in [https://doi.org/10.1107/S2059798317000699 Diederichs (2017) ''Acta Cryst'' '''D73''', 286-293] (open access). This results in segmentation, i.e. an arrangement of the N datasets represented by N vectors in a low-dimensional space. Typically, the dimension of that space may be chosen as n=2 to 4, but may be higher if N is large.
 
n=1 would only be suitable if the data sets only differ in their random error (i.e. they are highly isomorphous).  One more dimension is required for each additional systematic property which may vary between the data sets, e.g. n=2 is suitable if they only differ in their indexing mode (which then only should have two alternatives!), or in some other systematic property, like the length of a cell axis. Higher values of n (e.g. n=4) are appropriate if e.g. there are 4 indexing possibilities (which is the case in P3<sub>x</sub>), or more systematic ways in which the data sets may differ (like significant variations in the a, b and c axes), or conformational or compositional differences. In cases where data sets differ e.g. with respect to the composition or conformation of crystallized molecules, it is ''a priori'' unknown which value of n should be chosen, so several values need to be tried, and the results inspected (see [[Xscale_isocluster#Notes]]).
 
After segmentation of data sets in n-dimensional space, the program may be used (by specifying the -clu <m> option; default m=1) to try and identify <m> clusters of data sets. The program writes files called XSCALE.1.INP with lines required for scaling the datasets of cluster 1, and similarly XSCALE.2.INP for cluster 2, and so on. Typically, one may want to create directories cluster1 cluster2 ..., and then establish symlinks (called XSCALE.INP) in these to the XSCALE.#.INP files. This enables separate scaling of each cluster.


n=1 would be suitable if the datasets only differ in their random error (i.e. they are highly isomorphous).  One more dimension is required for each additional systematic property which may vary between the datasets, e.g. n=2 is suitable if they only differ in their indexing mode (which then only should have two alternatives!), or in some other systematic property, like the length of a cell axis. Higher values of n (e.g. n=4) are appropriate if e.g. there are 4 indexing possibilities (which is the case in P3<sub>x</sub>), or more systematic ways in which the datasets may differ (like significant variations in the a, b and c axes), or conformational or compositional differences. In cases where datasets differ e.g. with respect to the composition or conformation of crystallized molecules, it is ''a priori'' unknown which value of n should be chosen, so several values need to be tried, and the results inspected (see [[Xscale_isocluster#Notes]]).
Furthermore, a file iso.pdb is produced that may be loaded into coot. Then use Show/Cell and Symmetry/Show unit cell (to see the origin, which coot marks with "0"), and visualize the relations between data sets. Systematic differences are related to the angle (with the tip of the angle at the origin) between the vectors that represent the data sets; ideally, in the case of isomorphous data sets all vectors point into the same direction. Random differences are related to the lengths of the vectors (starting at the origin; short vectors correspond to weak/noisy data sets). With the -i option, individual iso.x.pdb files can be written for each cluster. For an example, see [[SSX]].


An attempt is made to automatically identify clusters of datasets. The program writes files called XSCALE.1.INP with lines required for scaling the datasets of cluster 1, and similarly XSCALE.2.INP for cluster 2, and so on. Typically, one may want to create directories cluster1 cluster2 ..., and then establish symlinks (called XSCALE.INP) in these to the XSCALE.#.INP files. This enables separate scaling of each cluster.
=== Resolving indexing ambiguity ===
A useful set of options for resolving an indexing ambiguity is shown in the following example:
xscale_isocluster -i -dim 2 -clu 2 -dmin 20 -dmax 2.5 XSCALE.HKL
coot iso.1.pdb iso.2.pdb
Specifying dmin and dmax is often needed because a) the high resolution is often weak and noisy, and should be excluded, and b) the default of the program is to choose the resolution range that is covered by all datasets, which may be quite narrow.


Furthermore, a file iso.pdb is produced that should be loaded into coot. Then use Show/Cell and Symmetry/Show unit cell, and visualize the relations between datasets. Optionally, individual iso.x.pdb files can be written for each cluster. For an example, see [[SSX]].
In coot, use Draw->Cell-&-Symmetry->Show Unit Cells->Yes and use the Display Manager to switch the pseudo-PDB files on/off. In this 2-dimensional analysis, all pseudo-atoms are in the AB plane. The two clusters may be situated at a right angle; one of these corresponds to one (internally consistent) way of indexing, the other to the other way of indexing. This means that after each INPUT_FILE= line of one of the two XSCALE.?.INP files, a re-indexing line such as
REIDX_ISET= -1 0 0 0  0 -1 0 0  0 0 1 0
should be added which would be valid for spacegroups 150, 152 and 154 (and 143-145 for which there are also other possibilities), or
REIDX_ISET= 0 1 0 0  1 0 0 0  0 0 -1 0
which would be valid for spacegroups 75-80, 149, 151, 153, 168-173, and 195-199 (and 143-146 for which there are other also other possibilities) (see [[Space_group_determination#Table_of_space_groups_by_Laue_class_and_Bravais_type]] and https://www.ccp4.ac.uk/html/reindexing.html).


== Output ==
== Output ==
Line 31: Line 49:
The console output gives informational and error messages. Each file XSCALE.x.INP enumerates the contributing INPUT_FILEs in the order of increasing angular distance. Example:  
The console output gives informational and error messages. Each file XSCALE.x.INP enumerates the contributing INPUT_FILEs in the order of increasing angular distance. Example:  
<pre>
<pre>
UNIT_CELL_CONSTANTS= 91.490 91.490  68.790   90.000  90.000  120.000
UNIT_CELL_CONSTANTS=   88.740  88.740 104.930   90.000  90.000  120.000
SPACE_GROUP_NUMBER= 145
SPACE_GROUP_NUMBER= 152
OUTPUT_FILE=XSCALE.1.HKL
OUTPUT_FILE=XSCALE.1.HKL
FRIEDEL'S_LAW=FALSE
SAVE_CORRECTION_IMAGES=FALSE
SAVE_CORRECTION_IMAGES=FALSE
WFAC1=1
PRINT_CORRELATIONS=FALSE
INPUT_FILE=../x4/XDS_ASCII.HKL
WFAC1=1.25 ! XDS/XSCALE defaults are 1.0/1.5
!new, old ISET=      1     3 strength,dist,cluster=    0.855    0.035      1
INPUT_FILE=../xds_ss091d3chip/1501_1506/XDS_ASCII.HKL
!INCLUDE_RESOLUTION_RANGE=00 00
!new, old ISET=      1   134 length=CC*,angle,cluster=    0.120     0.4      1
INPUT_FILE=../x3/XDS_ASCII.HKL
!new, old ISET=     2      2 strength,dist,cluster=    0.861    0.045     1
!INCLUDE_RESOLUTION_RANGE=00 00
INPUT_FILE=../x9/XDS_ASCII.HKL
!new, old ISET=     3      7 strength,dist,cluster=    0.852     0.112      1
!INCLUDE_RESOLUTION_RANGE=00 00
INPUT_FILE=../x7/XDS_ASCII.HKL
!new, old ISET=      4     6 strength,dist,cluster=    0.902    0.155     1
!INCLUDE_RESOLUTION_RANGE=00 00
!INCLUDE_RESOLUTION_RANGE=00 00
INPUT_FILE=../x1/XDS_ASCII.HKL
WEIGHT=    1.000
!new, old ISET=      5      1 strength,dist,cluster=    0.749     0.173     1
INPUT_FILE=../xds_ss091c10chip/2281_2286/XDS_ASCII.HKL
!new, old ISET=      2    96 length=CC*,angle,cluster=    0.922     1.9     1
!INCLUDE_RESOLUTION_RANGE=00 00
!INCLUDE_RESOLUTION_RANGE=00 00
INPUT_FILE=../x5/XDS_ASCII.HKL
WEIGHT=    1.001
!new, old ISET=      6      4 strength,dist,cluster=    0.678     0.223     1
INPUT_FILE=../xds_ss091b11chip/751_756/XDS_ASCII.HKL
!new, old ISET=      3    46 length=CC*,angle,cluster=    0.556     2.1     1
!INCLUDE_RESOLUTION_RANGE=00 00
!INCLUDE_RESOLUTION_RANGE=00 00
INPUT_FILE=../x6/XDS_ASCII.HKL
WEIGHT=    1.001
!new, old ISET=      7      5 strength,dist,cluster=    0.788    0.406     1
INPUT_FILE=../xds_ss091a11chip/121_126/XDS_ASCII.HKL
!new, old ISET=      4    14 length=CC*,angle,cluster=    0.602    22.8     1
!INCLUDE_RESOLUTION_RANGE=00 00
!INCLUDE_RESOLUTION_RANGE=00 00
...
</pre>
</pre>


Each INPUT_FILE line is followed by a comment line. In this, the first two numbers (''new'' and ''old'') refer to the numbering of datasets in the resulting XSCALE.#.INP,  ''versus'' that in the original XSCALE.INP (which produced XSCALE_FILE). Then, ''dist'' refers to arccosine of the angle (e.g. a value of 1.57 would mean 90 degrees) to the center of the cluster (the lower the better/closer), ''strength'' refers to vector length which is inversely proportional to the random noise in a data set, and ''cluster'', if negative, identifies a dataset that is outside the core of the cluster. To select good datasets and reject bad ones, the user may comment out INPUT_FILE lines which refer to datasets that are far away in angle or outside the core of the cluster. Furthermore, resolution ranges may be specified, possibly based on the output of [[XDSCC12]].
Each INPUT_FILE line is followed by a comment line. In this, the first two numbers (''new'' and ''old'') refer to the numbering of datasets in the resulting XSCALE.#.INP,  ''versus'' that in the original XSCALE.INP (which produced XSCALE_FILE). Then, ''length=CC*,angle,cluster'' refers to vector length which is inversely proportional to the random noise in a data set, to the angle (in degrees) to the center of the cluster (the lower the better/closer), and to ''cluster'', which if negative, identifies a dataset that is outside the core of the cluster. To select good datasets and reject bad ones, the user may comment out INPUT_FILE lines which refer to datasets that are far away in angle or outside the core of the cluster. Furthermore, resolution ranges may be specified, possibly based on the output of [[XDSCC12]].


== Notes ==
== Notes ==
* For meaningful results, the number of known values (N*(N-1)/2 is the number of pairwise correlation coefficients) should be (preferrably much) higher than the number of unknowns (1+n*(N-1)). This means that one needs at least 5 datasets if dim=2, and at least 7 if dim=3.
* For meaningful results, the number of known values [N*(N-1)/2 is the number of pairwise correlation coefficients] should be (preferrably much) higher than the number of unknowns (1+n*(N-1)). This means that one needs at least 5 data sets if dim=2, and at least 7 if dim=3.
* The clustering of datasets in a low-dimensional space uses the method of Rodriguez and Laio (2014) ''Science'' '''344''', 1492-1496.
* The clustering of data sets in a low-dimensional space uses the method of Rodriguez and Laio (2014) ''Science'' '''344''', 1492-1496. The clustering result should be checked by the user; one should not rely on this to give sensible results! The main criterion for a cluster should be that all data sets in it are in the same or similar direction, when seen from the origin ("0" in coot) - the length of each vector is not important since it is ''not'' related to the amount of non-isomorphism, but to the strength of the data set.
* The eigenvalues are printed out by the program, and can be used to deduce the proper value of the required dimension n. To make use of this, one should run with a high value of dim (e.g. 5), and inspect the list of eigenvalues with the goal of finding a significant drop in magnitude (e.g. a factor of 3 drop between the second and third eigenvalue would point to the third eigenvector being of low importance).
* A different but related program is [[xds_nonisomorphism]].
* A different but related program is [[xds_nonisomorphism]].

Latest revision as of 13:12, 28 June 2022

xscale_isocluster (Linux binary)(Mac binary) is a program that clusters datasets stored in a single unmerged reflection file as written by XSCALE. It implements the method of Brehm and Diederichs (2014) and theory of Diederichs (2017).

The help output (obtained by using the -h option) is

xscale_isocluster KD 2020-07-27. -h option shows options
Licensed for academic use only; no redistribution.
Please cite Assmann,Wang and Diederichs (2020) Acta Cryst D76, 636.
usage: xscale_isocluster -dmin <lowres> -dmax <highres> -nbin <nbin> -mode <1 or 2> -dim <dim> -clu <#> -cen <#,#,#,...> -<aAisw> XSCALE_FILE_NAME
highres, lowres and nbin (default=1 for low, and 10 for high multiplicity) have the usual meanings.
The default of highres and lowres is the COMMON resolution range of all datasets in XSCALE_FILE.
mode can be 1 (equal volumes of resolution shells) or 2 (increasing volumes; default).
dim is number of dimensions (default 3).
clu (by default 1) is number of clusters.
cen (by default automatically) is a set of cluster centers (up to 9) specified by their ISET.
cen must be specified after clu, and the number of values given must match clu.
 -a: use anomalous (instead of isomorphous) signal only (unsuitable for partial datasets)
 -A: account for anomalous signal by separating I+ and I- (suitable for partial datasets)
 -i: write individual pseudo-PDB files to visualize clusters
 -s: scale (default 1) for WEIGHT=1/cos(scale*angle) values written to XSCALE.*.INP file(s)
 -w: no weighting of intensities with their sigmas
The XSCALE.?.INP output files have angles [deg] w.r.t. cluster centers

Usage

For data set analysis, the program uses the method of Brehm and Diederichs (2014) Acta Cryst D70, 101-109 (PDF) whose theoretical background is in Diederichs (2017) Acta Cryst D73, 286-293 (open access). This results in segmentation, i.e. an arrangement of the N datasets represented by N vectors in a low-dimensional space. Typically, the dimension of that space may be chosen as n=2 to 4, but may be higher if N is large.

n=1 would only be suitable if the data sets only differ in their random error (i.e. they are highly isomorphous). One more dimension is required for each additional systematic property which may vary between the data sets, e.g. n=2 is suitable if they only differ in their indexing mode (which then only should have two alternatives!), or in some other systematic property, like the length of a cell axis. Higher values of n (e.g. n=4) are appropriate if e.g. there are 4 indexing possibilities (which is the case in P3x), or more systematic ways in which the data sets may differ (like significant variations in the a, b and c axes), or conformational or compositional differences. In cases where data sets differ e.g. with respect to the composition or conformation of crystallized molecules, it is a priori unknown which value of n should be chosen, so several values need to be tried, and the results inspected (see Xscale_isocluster#Notes).

After segmentation of data sets in n-dimensional space, the program may be used (by specifying the -clu <m> option; default m=1) to try and identify <m> clusters of data sets. The program writes files called XSCALE.1.INP with lines required for scaling the datasets of cluster 1, and similarly XSCALE.2.INP for cluster 2, and so on. Typically, one may want to create directories cluster1 cluster2 ..., and then establish symlinks (called XSCALE.INP) in these to the XSCALE.#.INP files. This enables separate scaling of each cluster.

Furthermore, a file iso.pdb is produced that may be loaded into coot. Then use Show/Cell and Symmetry/Show unit cell (to see the origin, which coot marks with "0"), and visualize the relations between data sets. Systematic differences are related to the angle (with the tip of the angle at the origin) between the vectors that represent the data sets; ideally, in the case of isomorphous data sets all vectors point into the same direction. Random differences are related to the lengths of the vectors (starting at the origin; short vectors correspond to weak/noisy data sets). With the -i option, individual iso.x.pdb files can be written for each cluster. For an example, see SSX.

Resolving indexing ambiguity

A useful set of options for resolving an indexing ambiguity is shown in the following example:

xscale_isocluster -i -dim 2 -clu 2 -dmin 20 -dmax 2.5 XSCALE.HKL
coot iso.1.pdb iso.2.pdb

Specifying dmin and dmax is often needed because a) the high resolution is often weak and noisy, and should be excluded, and b) the default of the program is to choose the resolution range that is covered by all datasets, which may be quite narrow.

In coot, use Draw->Cell-&-Symmetry->Show Unit Cells->Yes and use the Display Manager to switch the pseudo-PDB files on/off. In this 2-dimensional analysis, all pseudo-atoms are in the AB plane. The two clusters may be situated at a right angle; one of these corresponds to one (internally consistent) way of indexing, the other to the other way of indexing. This means that after each INPUT_FILE= line of one of the two XSCALE.?.INP files, a re-indexing line such as

REIDX_ISET= -1 0 0 0  0 -1 0 0  0 0 1 0

should be added which would be valid for spacegroups 150, 152 and 154 (and 143-145 for which there are also other possibilities), or

REIDX_ISET= 0 1 0 0  1 0 0 0  0 0 -1 0

which would be valid for spacegroups 75-80, 149, 151, 153, 168-173, and 195-199 (and 143-146 for which there are other also other possibilities) (see Space_group_determination#Table_of_space_groups_by_Laue_class_and_Bravais_type and https://www.ccp4.ac.uk/html/reindexing.html).

Output

The console output gives informational and error messages. Each file XSCALE.x.INP enumerates the contributing INPUT_FILEs in the order of increasing angular distance. Example:

UNIT_CELL_CONSTANTS=   88.740   88.740  104.930   90.000   90.000  120.000
SPACE_GROUP_NUMBER= 152
OUTPUT_FILE=XSCALE.1.HKL
SAVE_CORRECTION_IMAGES=FALSE
PRINT_CORRELATIONS=FALSE
WFAC1=1.25 ! XDS/XSCALE defaults are 1.0/1.5
INPUT_FILE=../xds_ss091d3chip/1501_1506/XDS_ASCII.HKL
!new, old ISET=      1    134 length=CC*,angle,cluster=     0.120     0.4      1
!INCLUDE_RESOLUTION_RANGE=00 00
WEIGHT=     1.000
INPUT_FILE=../xds_ss091c10chip/2281_2286/XDS_ASCII.HKL
!new, old ISET=      2     96 length=CC*,angle,cluster=     0.922     1.9      1
!INCLUDE_RESOLUTION_RANGE=00 00
WEIGHT=     1.001
INPUT_FILE=../xds_ss091b11chip/751_756/XDS_ASCII.HKL
!new, old ISET=      3     46 length=CC*,angle,cluster=     0.556     2.1      1
!INCLUDE_RESOLUTION_RANGE=00 00
WEIGHT=     1.001
INPUT_FILE=../xds_ss091a11chip/121_126/XDS_ASCII.HKL
!new, old ISET=      4     14 length=CC*,angle,cluster=     0.602    22.8      1
!INCLUDE_RESOLUTION_RANGE=00 00
...

Each INPUT_FILE line is followed by a comment line. In this, the first two numbers (new and old) refer to the numbering of datasets in the resulting XSCALE.#.INP, versus that in the original XSCALE.INP (which produced XSCALE_FILE). Then, length=CC*,angle,cluster refers to vector length which is inversely proportional to the random noise in a data set, to the angle (in degrees) to the center of the cluster (the lower the better/closer), and to cluster, which if negative, identifies a dataset that is outside the core of the cluster. To select good datasets and reject bad ones, the user may comment out INPUT_FILE lines which refer to datasets that are far away in angle or outside the core of the cluster. Furthermore, resolution ranges may be specified, possibly based on the output of XDSCC12.

Notes

  • For meaningful results, the number of known values [N*(N-1)/2 is the number of pairwise correlation coefficients] should be (preferrably much) higher than the number of unknowns (1+n*(N-1)). This means that one needs at least 5 data sets if dim=2, and at least 7 if dim=3.
  • The clustering of data sets in a low-dimensional space uses the method of Rodriguez and Laio (2014) Science 344, 1492-1496. The clustering result should be checked by the user; one should not rely on this to give sensible results! The main criterion for a cluster should be that all data sets in it are in the same or similar direction, when seen from the origin ("0" in coot) - the length of each vector is not important since it is not related to the amount of non-isomorphism, but to the strength of the data set.
  • The eigenvalues are printed out by the program, and can be used to deduce the proper value of the required dimension n. To make use of this, one should run with a high value of dim (e.g. 5), and inspect the list of eigenvalues with the goal of finding a significant drop in magnitude (e.g. a factor of 3 drop between the second and third eigenvalue would point to the third eigenvector being of low importance).
  • A different but related program is xds_nonisomorphism.