Cc analysis: Difference between revisions

From XDSwiki
Jump to navigation Jump to search
(link to binary)
 
(15 intermediate revisions by 2 users not shown)
Line 1: Line 1:
The purpose of cc_analysis is to find a representation in low-dimensional space (e.g. 2D, 3D, 4D) that displays the relations of data sets to each other. For this, the procedure needs the values of correlation coefficients (CCs) calculated between the data sets (e.g. crystallographic intensities, or pixels of images). (Not all theoretically possible pairwise CCs are needed, but of course, the more the better.)
The purpose of cc_analysis is to find a representation in low-dimensional space (e.g. 2D, 3D, 4D) that displays the relations of data sets to each other. For this, the procedure needs the values of correlation coefficients (CCs) calculated between the data sets (e.g. crystallographic intensities, or pixels of images). (Not all theoretically possible pairwise CCs are needed, but of course, the more the better.)


Since the data sets are composed of many measurements, they could be thought of as residing in a high-dimensional space: in case of crystallography, that dimension is the number of unique reflections; in case of images, the number of pixels.  
Since the data sets are composed of many measurements, they could be thought of as residing in a high-dimensional space: in case of crystallography, that dimension is the number of unique reflections; in case of images, the number of pixels.


As the result (the vectors) are in low-dimensional space, and the data sets are in high-dimensional space, the procedure may be considered as ''multidimensional scaling'' - there are other procedures in multidimensional scaling, but this particular one has first been described in [http://journals.iucr.org/d/issues/2017/04/00/rr5141/index.html Diederichs, Acta D (2017)]. Alternatively, we can think of the procedure as ''unsupervised learning'', because it "learns" from the given CCs, and predicts the unknown CCs - or rather, the relations of even those data sets that have nothing (crystallography: no reflections; imaging: no pixels) in common.
How is the low-dimensional representation found? If N is the number of data sets and <math>cc_{ij}</math> denotes the correlation coefficients between data sets i and j, cc_analysis minimizes
<math>\Phi(\mathbf{x} )=\sum_{i=1}^{N-1}\sum_{j=i+1}^{N}\left(cc_{ij}-x_{i}\cdot x_{j}\right)^{2}</math>
 
as a function of the vector <math>\bf{x}</math>, the column vector of the N low-dimensional vectors <math>\it{\{{x_{k}\}}}</math>. This can be performed by minimizing from random starting positions, or more elegantly and efficiently by obtaining starting positions through Eigen decomposition after estimating the missing values of the matrix of correlation coefficients.
 
As the resulting vectors <math>\it{\{{x_{k}\}}}</math> are in low-dimensional space, and the data sets reside in high-dimensional space, the procedure may be considered as ''multidimensional scaling''. This procedure has first been described in [http://journals.iucr.org/d/issues/2017/04/00/rr5141/index.html Diederichs, Acta D (2017)]. We can also think of the procedure as ''unsupervised learning'', because it "learns" from the given CCs, and predicts the unknown CCs - or rather, the relations of even those data sets that have nothing (crystallography: no reflections; imaging: no pixels) in common.


== Properties of cc_analysis ==
== Properties of cc_analysis ==
Line 9: Line 15:
It turns out that the dot product (also called scalar product, or inner product) of low-dimensional vectors (representing the high-dimensional data sets) is very appropriate for approximating the CCs. This is because a CC is itself a dot product (eqn. 3 of [http://journals.iucr.org/d/issues/2017/04/00/rr5141/index.html Diederichs, Acta D (2017)).
It turns out that the dot product (also called scalar product, or inner product) of low-dimensional vectors (representing the high-dimensional data sets) is very appropriate for approximating the CCs. This is because a CC is itself a dot product (eqn. 3 of [http://journals.iucr.org/d/issues/2017/04/00/rr5141/index.html Diederichs, Acta D (2017)).


It further turns out that if the dot product is used, then the resulting arrangement of vectors (that minimize a functional defined in eqn. 1) has the properties  
It further turns out that the resulting arrangement of vectors (that minimize the function <math>\Phi(\mathbf{x})</math> given above) has the properties  
# 0 <= length <= 1 for each vector; short vectors have a low signal-to-noise ratio, long vectors a good signal-to-noise ratio, and vectors of length 1 represent ideal (prototypical) data sets. In fact, the length of each vector is CC* (as defined in [http://dx.doi.org/10.1126/science.1218231 Karplus & Diederichs (2012)]), and its exact relation to the signal-to-noise ratio is given in eqn. 4.
# 0 <= length <= 1 for each vector; short vectors have a low signal-to-noise ratio, long vectors a good signal-to-noise ratio, and vectors of length 1 represent ideal (prototypical) data sets. In fact, the length of each vector is CC* (as defined in [http://dx.doi.org/10.1126/science.1218231 Karplus & Diederichs (2012)]), and its relation to the signal-to-noise ratio is given in eqn. 4.
# vectors point in the same direction (i.e. they lie on a radial line) if their data sets only differ by random noise
# vectors point in the same direction (i.e. they lie on a radial line) if their data sets only differ by random noise
# vectors representing systematically different (heterogeneous) data sets enclose an angle whose cosine is the CC of their respective ideal data sets (eqn. 5).
# vectors representing systematically different (heterogeneous) data sets enclose an angle whose cosine is the CC of their respective ideal data sets (eqn. 5).
# if all CCs are known, the solution is unique in terms of lengths of vectors, and angles between them. However, a rotated (around the origin) or inverted (through the origin) arrangement of the vectors leaves the functional unchanged, because these transformations do not change lengths and angles.  
# if all CCs are known, the solution is unique in terms of lengths of vectors, and angles between them. However, a rotated (around the origin) or inverted (through the origin) arrangement of the vectors leaves the functional unchanged, because these transformations do not change lengths and angles.  
# as long as the problem is over-determined, the vectors can be calculated. Unknown CCs between data sets (e.g. in case of crystallographic data sets that don't have common reflections) can be estimated from the dot product of their vectors. Over-determination means: each data set has to be related (directly or indirectly i.e through others) to any other by at least as many CCs as the desired number of dimensions is.
# as long as the problem is well-determined, the vectors can be calculated. Unknown CCs between data sets (e.g. in case of crystallographic data sets that don't have common reflections) can be estimated from the dot product of their vectors. Well-determination means: each data set has to be related (directly or indirectly i.e through others) to any other by at least as many CCs as the desired number of dimensions is. A necessary condition for this is that each data set has at least as many relations to others (input lines to cc_analysis involving this data set) as the number of dimensions is. It is of course better if more relations are specified!


== The program ==
== The Fortran program ==
<code>cc_analysis</code> calculates the vectors from the pairwise correlation coefficients. The (low) dimension must be specified, and a file with lines specifying the correlation coefficients must be provided.  
<code>cc_analysis</code> calculates the vectors from the pairwise correlation coefficients. The (low) dimension must be specified, and a file with lines specifying the correlation coefficients must be provided.  


Line 23: Line 29:
  <input.dat> has lines with items: i j corr [ncorr]
  <input.dat> has lines with items: i j corr [ncorr]
  -b option: <input.dat> is a binary file (4 bytes for each item)
  -b option: <input.dat> is a binary file (4 bytes for each item)
  -w option: calculate weights from of correlated items (4th item on input line)
  -w option: calculate weights from number of correlated items (4th item on input line)
  -z option: use Fisher z-transformation
  -z option: use Fisher z-transformation
  -f option: skip some calculations (fast)
  -f option: skip some calculations (fast)
Line 32: Line 38:
* the number of vectors must be > 2*(low dimension). Typical number of dimensions is 2 or 3, but depending on the problem it could of course be much more.
* the number of vectors must be > 2*(low dimension). Typical number of dimensions is 2 or 3, but depending on the problem it could of course be much more.


A Linux binary is available [ftp://strucbio.biologie.uni-konstanz.de/pub/cc_analysis].
Python code is available as [https://wiki.uni-konstanz.de/pub/cc_analysis.py] under GPL.


== Example ==
== Example ==
Suppose we have 5 objects (e.g. images or crystallographic data sets), and measure their correlation coefficients against each other. These must be written to a file (named cc.dat in this example), and given as input to cc_analysis:
<pre>
<pre>
bash-4.2$ cat cc.dat  
bash-4.2$ cat cc.dat  
Line 48: Line 55:
     4      5  0.8432
     4      5  0.8432
</pre>
</pre>
Please note that the CCs are rounded to 4 valid digits. This introduces a bit of noise.
Please note that the CCs are rounded to 4 valid digits. This introduces a bit of noise. In addition, the solution in this example is not overdetermined; 5 is just the minimum number of objects that can be represented in 2-dimensional space when given 10 unique correlation coefficients.
<pre>
<pre>
bash-4.2$ cc_analysis -dim 2 cc.dat solution.dat
bash-4.2$ cc_analysis -dim 2 cc.dat solution.dat
  CC_ANALYSIS version 29.10.2018 (K. Diederichs). No redistribution please!
  CC_ANALYSIS version 29.10.2018 (K. Diederichs). No redistribution please!
compiler: Intel(R) Fortran Intel(R) 64 Compiler for applications running on Intel(R) 64, Version 18.0.5.274 Build 20180823 options: -msse4.2 -L/usr/local/src/arpack/ -larpack_ifort -lmkl_lapack95_lp64 -lmkl_blas95_lp64 -mkl -static-intel -qopenmp-link=static -sox -traceback -qopenmp -align array64byte -assume buffered_io -o /home/dikay/bin/cc_analysis


Linux turn31.biologie.uni-konstanz.de 4.18.14-1.el7.elrepo.x86_64 #1 SMP Sat Oct 13 10:29:59 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux
Linux turn31.biologie.uni-konstanz.de 4.18.14-1.el7.elrepo.x86_64 #1 SMP Sat Oct 13 10:29:59 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux
Line 94: Line 99:
     4  0.9760  0.0087  0.9760  0.0089
     4  0.9760  0.0087  0.9760  0.0089
     5  0.8626  0.1361  0.8733  0.1564
     5  0.8626  0.1361  0.8733  0.1564
</pre>
The output is: <vector #> <x> <y> <length> <angle> for each vector, in this 2-dimensional case; equivalently for higher dimensions.


The output is: <vector #> <x> <y> <length> <angle> for each vector, in this 2-dimensional case; equivalently for higher dimensions.
The Python code produces the following output:
<pre>
bash-4.2$ python /usr/local/bin/cc_analysis.py 2 cc.dat
===
Correlation matrix parsed from infile:
[[  nan 0.017  0.0222 0.0233 0.0226]
[0.017    nan 0.7026 0.7287 0.6241]
[0.0222 0.7026    nan 0.9131 0.8049]
[0.0233 0.7287 0.9131    nan 0.8432]
[0.0226 0.6241 0.8049 0.8432    nan]]
===
Correction factor for 2nd and higher eigenvalue(s):
0.8000
===
Interpretation of correlation matrix as dot product matrix:
---
all h_i by iterative approach:
initial values:
[0.1459 0.7198 0.7815 0.7919 0.7574]
refinement by iteration:
#13: [0.0242 0.7414 0.9389 0.9798 0.8551]
===
Uncorrected eigenvalue(s):
2 used:
[3.1228 0.0126]
3 unused:
[ 0.0045 -0.0004 -0.0167]
---
Corrected eigenvalue(s):
2 used:
[3.1228 0.0158]
iter      RMS  max_chg  rms_chg
  0  0.00345        -        -
  1  0.00241 -0.04680  0.00403
  2  0.00127  0.01783  0.00275
  3  0.00057  0.01297  0.00162
  4  0.00029  0.00570  0.00073
  5  0.00023  0.00182  0.00026
  6  0.00023 -0.00047  0.00008
  7  0.00022 -0.00042  0.00004
  8  0.00022 -0.00042  0.00004
  9  0.00022 -0.00045  0.00005
  10  0.00022 -0.00045  0.00005
  11  0.00022 -0.00045  0.00005
  12  0.00021 -0.00044  0.00005
  13  0.00021 -0.00043  0.00005
  14  0.00021 -0.00043  0.00005
  15  0.00021 -0.00042  0.00004
  16  0.00021 -0.00042  0.00004
  17  0.00020 -0.00042  0.00004
  18  0.00020 -0.00042  0.00004
  19  0.00020 -0.00042  0.00004
  20  0.00020 -0.00042  0.00004
    1  0.0241 -0.0098
    2  0.7484  0.1425
    3  0.9359  0.0150
    4  0.9758 -0.0112
    5  0.8624 -0.1501
===
Finished outputting 2-dimensional representative vectors! =)
</pre>
The 5 lines at the bottom give the solution. The coordinates agree with those of the Fortran program within a rms deviation of 0.0055 after mirroring across the x axis (inverted solution) and rotating by half a degree.

Latest revision as of 10:24, 11 August 2023

The purpose of cc_analysis is to find a representation in low-dimensional space (e.g. 2D, 3D, 4D) that displays the relations of data sets to each other. For this, the procedure needs the values of correlation coefficients (CCs) calculated between the data sets (e.g. crystallographic intensities, or pixels of images). (Not all theoretically possible pairwise CCs are needed, but of course, the more the better.)

Since the data sets are composed of many measurements, they could be thought of as residing in a high-dimensional space: in case of crystallography, that dimension is the number of unique reflections; in case of images, the number of pixels.

How is the low-dimensional representation found? If N is the number of data sets and [math]\displaystyle{ cc_{ij} }[/math] denotes the correlation coefficients between data sets i and j, cc_analysis minimizes

[math]\displaystyle{ \Phi(\mathbf{x} )=\sum_{i=1}^{N-1}\sum_{j=i+1}^{N}\left(cc_{ij}-x_{i}\cdot x_{j}\right)^{2} }[/math]

as a function of the vector [math]\displaystyle{ \bf{x} }[/math], the column vector of the N low-dimensional vectors [math]\displaystyle{ \it{\{{x_{k}\}}} }[/math]. This can be performed by minimizing from random starting positions, or more elegantly and efficiently by obtaining starting positions through Eigen decomposition after estimating the missing values of the matrix of correlation coefficients.

As the resulting vectors [math]\displaystyle{ \it{\{{x_{k}\}}} }[/math] are in low-dimensional space, and the data sets reside in high-dimensional space, the procedure may be considered as multidimensional scaling. This procedure has first been described in Diederichs, Acta D (2017). We can also think of the procedure as unsupervised learning, because it "learns" from the given CCs, and predicts the unknown CCs - or rather, the relations of even those data sets that have nothing (crystallography: no reflections; imaging: no pixels) in common.

Properties of cc_analysis

It turns out that the dot product (also called scalar product, or inner product) of low-dimensional vectors (representing the high-dimensional data sets) is very appropriate for approximating the CCs. This is because a CC is itself a dot product (eqn. 3 of [http://journals.iucr.org/d/issues/2017/04/00/rr5141/index.html Diederichs, Acta D (2017)).

It further turns out that the resulting arrangement of vectors (that minimize the function [math]\displaystyle{ \Phi(\mathbf{x}) }[/math] given above) has the properties

  1. 0 <= length <= 1 for each vector; short vectors have a low signal-to-noise ratio, long vectors a good signal-to-noise ratio, and vectors of length 1 represent ideal (prototypical) data sets. In fact, the length of each vector is CC* (as defined in Karplus & Diederichs (2012)), and its relation to the signal-to-noise ratio is given in eqn. 4.
  2. vectors point in the same direction (i.e. they lie on a radial line) if their data sets only differ by random noise
  3. vectors representing systematically different (heterogeneous) data sets enclose an angle whose cosine is the CC of their respective ideal data sets (eqn. 5).
  4. if all CCs are known, the solution is unique in terms of lengths of vectors, and angles between them. However, a rotated (around the origin) or inverted (through the origin) arrangement of the vectors leaves the functional unchanged, because these transformations do not change lengths and angles.
  5. as long as the problem is well-determined, the vectors can be calculated. Unknown CCs between data sets (e.g. in case of crystallographic data sets that don't have common reflections) can be estimated from the dot product of their vectors. Well-determination means: each data set has to be related (directly or indirectly i.e through others) to any other by at least as many CCs as the desired number of dimensions is. A necessary condition for this is that each data set has at least as many relations to others (input lines to cc_analysis involving this data set) as the number of dimensions is. It is of course better if more relations are specified!

The Fortran program

cc_analysis calculates the vectors from the pairwise correlation coefficients. The (low) dimension must be specified, and a file with lines specifying the correlation coefficients must be provided.

CC_ANALYSIS version 30.12.2018 (K. Diederichs). No redistribution please!
cc_analysis -dim <dim> [-b] [-w] [-z] <input.dat> <output.dat>
<input.dat> has lines with items: i j corr [ncorr]
-b option: <input.dat> is a binary file (4 bytes for each item)
-w option: calculate weights from number of correlated items (4th item on input line)
-z option: use Fisher z-transformation
-f option: skip some calculations (fast)
-m <iters> option: use <iters> (default 20) least-squares iterations
-t <threads> option: use <threads> (default 8) threads

Notes:

  • the number of vectors must be > 2*(low dimension). Typical number of dimensions is 2 or 3, but depending on the problem it could of course be much more.

Python code is available as [1] under GPL.

Example

Suppose we have 5 objects (e.g. images or crystallographic data sets), and measure their correlation coefficients against each other. These must be written to a file (named cc.dat in this example), and given as input to cc_analysis:

bash-4.2$ cat cc.dat 
     1      2  0.0170
     1      3  0.0222
     1      4  0.0233
     1      5  0.0226
     2      3  0.7026
     2      4  0.7287
     2      5  0.6241
     3      4  0.9131
     3      5  0.8049
     4      5  0.8432

Please note that the CCs are rounded to 4 valid digits. This introduces a bit of noise. In addition, the solution in this example is not overdetermined; 5 is just the minimum number of objects that can be represented in 2-dimensional space when given 10 unique correlation coefficients.

bash-4.2$ cc_analysis -dim 2 cc.dat solution.dat
 CC_ANALYSIS version 29.10.2018 (K. Diederichs). No redistribution please!

Linux turn31.biologie.uni-konstanz.de 4.18.14-1.el7.elrepo.x86_64 #1 SMP Sat Oct 13 10:29:59 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux
 OpenMP max and actual threads:  64  4
 reading i,j,CC[,ncc] from cc.dat
 reading the data:  1.9999999E-04 seconds
 memory to store data   :  1.2000000E-07 GB
 CCs read, # vectors, dim, avg:          10      5  2  0.47
 allocating   1.0000000E-07  GB
 checking the data:  3.9999999E-04 seconds
 connectivity:  0.0000000E+00 seconds
 starting values:  0.0000000E+00 seconds
 smallest eigenvalue (its absolute value is a measure of noise):  1.7127972E-02
 dim initial eigenvalue estimates, + up to 3 more:      3.1275     0.0171     0.0092    -0.0002
 dim improved eigenvalue estimates:   3.127476      1.7128032E-02
 Eigen-analysis:  1.7999999E-02 seconds
 iter   RMS     max_chg   rms_chg
  1   0.00334   0.00000   0.00000
  2   0.00300   0.02624   0.01034
  3   0.00239  -0.02112   0.00890
 19   0.00025  -0.00096   0.00032
 20   0.00024   0.00055   0.00029
 least-squares solution:  5.0000002E-04 seconds
 most negative CCobs-CCcalc:  -0.00040 at      1      4
 most positive CCobs-CCcalc:   0.00044 at      1      5
 avg, rms CCobs-CCcalc     :   0.00004   0.00025
 n, dim, trace=      5  2      3.1754
 eigenvalues after refining the first 2 :
 dim refined eigenvalues, + up to 3 more:     3.1324     0.0432     0.0007    -0.0001
 a posteriori Eigen-analysis:  9.9999997E-05 seconds
 writing #, cartesian coords, spherical coords =(length,angles) to solution.dat
 if first cartesian coord <0 then length is written as -length
 diagonal elements    : CCfin = 1.001*CCest -0.006 ; rc= 1.000
 off-diagonal elements: CCobs = 1.000*CCest + 0.000 ; rc= 1.000
 total time:  2.9899999E-02 seconds
bash-4.2$ cat solution.dat 
     1  0.0242  0.0095  0.0260  0.3739
     2  0.7480 -0.1556  0.7641 -0.2051
     3  0.9357 -0.0171  0.9358 -0.0183
     4  0.9760  0.0087  0.9760  0.0089
     5  0.8626  0.1361  0.8733  0.1564

The output is: <vector #> <x> <y> <length> <angle> for each vector, in this 2-dimensional case; equivalently for higher dimensions.

The Python code produces the following output:

bash-4.2$ python /usr/local/bin/cc_analysis.py 2 cc.dat
===
Correlation matrix parsed from infile:
[[   nan 0.017  0.0222 0.0233 0.0226]
 [0.017     nan 0.7026 0.7287 0.6241]
 [0.0222 0.7026    nan 0.9131 0.8049]
 [0.0233 0.7287 0.9131    nan 0.8432]
 [0.0226 0.6241 0.8049 0.8432    nan]]
===
Correction factor for 2nd and higher eigenvalue(s):
0.8000
===
Interpretation of correlation matrix as dot product matrix:
---
all h_i by iterative approach:
initial values:
[0.1459 0.7198 0.7815 0.7919 0.7574]
refinement by iteration:
#13: [0.0242 0.7414 0.9389 0.9798 0.8551]
===
Uncorrected eigenvalue(s):
2 used:
[3.1228 0.0126]
3 unused:
[ 0.0045 -0.0004 -0.0167]
---
Corrected eigenvalue(s):
2 used:
[3.1228 0.0158]
iter      RMS  max_chg  rms_chg
   0  0.00345        -        -
   1  0.00241 -0.04680  0.00403
   2  0.00127  0.01783  0.00275
   3  0.00057  0.01297  0.00162
   4  0.00029  0.00570  0.00073
   5  0.00023  0.00182  0.00026
   6  0.00023 -0.00047  0.00008
   7  0.00022 -0.00042  0.00004
   8  0.00022 -0.00042  0.00004
   9  0.00022 -0.00045  0.00005
  10  0.00022 -0.00045  0.00005
  11  0.00022 -0.00045  0.00005
  12  0.00021 -0.00044  0.00005
  13  0.00021 -0.00043  0.00005
  14  0.00021 -0.00043  0.00005
  15  0.00021 -0.00042  0.00004
  16  0.00021 -0.00042  0.00004
  17  0.00020 -0.00042  0.00004
  18  0.00020 -0.00042  0.00004
  19  0.00020 -0.00042  0.00004
  20  0.00020 -0.00042  0.00004
     1  0.0241 -0.0098
     2  0.7484  0.1425
     3  0.9359  0.0150
     4  0.9758 -0.0112
     5  0.8624 -0.1501
===
Finished outputting 2-dimensional representative vectors! =)

The 5 lines at the bottom give the solution. The coordinates agree with those of the Fortran program within a rms deviation of 0.0055 after mirroring across the x axis (inverted solution) and rotating by half a degree.