R-factors: Difference between revisions

Jump to navigation Jump to search
3,187 bytes added ,  26 January 2018
 
(21 intermediate revisions by 2 users not shown)
Line 3: Line 3:
In the following, all sums over hkl extend only over unique reflections with more than one observation!
In the following, all sums over hkl extend only over unique reflections with more than one observation!
* R<sub>sym</sub> and R<sub>merge</sub> - the formula for both is:
* R<sub>sym</sub> and R<sub>merge</sub> - the formula for both is:
<math>
 
R = \frac{\sum_{hkl} \sum_{j} \vert I_{hkl,j}-\langle I_{hkl}\rangle\vert}{\sum_{hkl} \sum_{j}I_{hkl,j}}
 
</math>
: <math>
where <math>\langle I_{hkl}\rangle</math> is the average of symmetry- (or Friedel-) related observations of a unique reflection.
R = \frac{\sum_{hkl} \sum_{j} \vert I_{hkl,j}-\langle I_{hkl}\rangle\vert}{\sum_{hkl} \sum_{j}I_{hkl,j}}
</math>
 
 
where <math>\langle I_{hkl}\rangle</math> is the average of symmetry- (or Friedel-) related observations of a unique reflection. The formula is due to Arndt, U.W., Crowther, R.A. & Mallet, J.F.W. A computer-linked cathode ray tube microdensitometer for X-ray crystallography. J. Phys. E:Sci. Instr. 1, 510−516 (1968). Any unique reflection with n=2 or more observations enters the sums.


It can be shown that this formula results in higher R-factors when the redundancy is higher (Diederichs and Karplus <ref name="DiKa97">K. Diederichs and P.A. Karplus (1997). Improved R-factors for diffraction data analysis in macromolecular crystallography. Nature Struct. Biol. 4, 269-275 [http://strucbio.biologie.uni-konstanz.de/strucbio/files/nsb-1997.pdf]</ref>). In other words, low-redundancy datasets appear better than high-redundancy ones, which obviously violates the intention of having an indicator of data quality!
It can be shown that this formula results in higher R-factors when the redundancy is higher (Diederichs and Karplus <ref name="DiKa97">K. Diederichs and P.A. Karplus (1997). Improved R-factors for diffraction data analysis in macromolecular crystallography. Nature Struct. Biol. 4, 269-275 [http://strucbio.biologie.uni-konstanz.de/strucbio/files/nsb-1997.pdf]</ref>). In other words, low-redundancy datasets appear better than high-redundancy ones, which obviously violates the intention of having an indicator of data quality!
* Redundancy-independant version of the above:  
* Redundancy-independant version of the above:  
<math>
 
R_{meas} = \frac{\sum_{hkl} \sqrt \frac{n}{n-1} \sum_{j=1}^{n} \vert I_{hkl,j}-\langle I_{hkl}\rangle\vert}{\sum_{hkl} \sum_{j}I_{hkl,j}}
 
</math>
: <math>
R_{meas} = \frac{\sum_{hkl} \sqrt \frac{n}{n-1} \sum_{j=1}^{n} \vert I_{hkl,j}-\langle I_{hkl}\rangle\vert}{\sum_{hkl} \sum_{j}I_{hkl,j}}
</math>
 
 
which unfortunately results in higher (but more realistic) numerical values than R<sub>sym</sub> / R<sub>merge</sub>  
which unfortunately results in higher (but more realistic) numerical values than R<sub>sym</sub> / R<sub>merge</sub>  
(Diederichs and Karplus <ref name="DiKa97"/> ,  
(Diederichs and Karplus <ref name="DiKa97"/> ,  
Weiss and Hilgenfeld <ref name="WeHi97">M.S. Weiss and R. Hilgenfeld (1997) On the use of the merging R-factor as a quality indicator for X-ray data. J. Appl. Crystallogr. 30, 203-205[http://dx.doi.org/10.1107/S0021889897003907]</ref>).
Weiss and Hilgenfeld <ref name="WeHi97">M.S. Weiss and R. Hilgenfeld (1997) On the use of the merging R-factor as a quality indicator for X-ray data. J. Appl. Crystallogr. 30, 203-205[http://dx.doi.org/10.1107/S0021889897003907]</ref>).


==== measuring quality of averaged intensities/amplitudes ====
==== measuring precision of averaged intensities/amplitudes ====


for intensities use  
for intensities use  
(Weiss <ref name="We01">M.S. Weiss. Global indicators of X-ray data quality. J. Appl. Cryst. (2001). 34, 130-135 [http://dx.doi.org/10.1107/S0021889800018227]</ref>)
(Weiss <ref name="We01">M.S. Weiss. Global indicators of X-ray data quality. J. Appl. Cryst. (2001). 34, 130-135 [http://dx.doi.org/10.1107/S0021889800018227]</ref>)
<math>
R_{p.i.m.} = \frac{\sum_{hkl} \sqrt \frac{1}{n-1} \sum_{j=1}^{n} \vert I_{hkl,j}-\langle I_{hkl}\rangle\vert}{\sum_{hkl} \sum_{j}I_{hkl,j}}
</math>


R<sub>mrgd-I</sub> is similarly defined in Diederichs and Karplus <ref name="DiKa97"/>.
 
: <math>
R_{p.i.m.} = \frac{\sum_{hkl} \sqrt \frac{1}{n-1} \sum_{j=1}^{n} \vert I_{hkl,j}-\langle I_{hkl}\rangle\vert}{\sum_{hkl} \sum_{j}I_{hkl,j}}
</math>
 
R<sub>mrgd-I</sub> (defined in Diederichs and Karplus <ref name="DiKa97"/>) only differs by a factor (FIXME: what is the factor? 0.5 or 1.4142 or ?) since it likewise takes the improvement in precision from multiplicity into account. R<sub>split</sub> , which is what the X-FEL community uses, is the same as R<sub>mrgd-I</sub> but that community seems not to be aware of this.  
      
      
Similarly, one should use R<sub>mrgd-F</sub> as a quality indicator for amplitudes <ref name="DiKa97"/>, which may be calculated as:  
Similarly, one should use R<sub>mrgd-F</sub> as a quality indicator for amplitudes <ref name="DiKa97"/>, which may be calculated as:  
<math>
 
 
: <math>
  R_{mrgd-F} = \frac{\sum_{hkl} \sqrt \frac{1}{n-1} \sum_{j=1}^{n} \vert F_{hkl,j}-\langle F_{hkl}\rangle\vert}{\sum_{hkl} \sum_{j}F_{hkl,j}}
  R_{mrgd-F} = \frac{\sum_{hkl} \sqrt \frac{1}{n-1} \sum_{j=1}^{n} \vert F_{hkl,j}-\langle F_{hkl}\rangle\vert}{\sum_{hkl} \sum_{j}F_{hkl,j}}
</math>
</math>
 
 
with <math>\langle F_{hkl}\rangle</math> defined analogously as <math>\langle I_{hkl}\rangle</math>.
with <math>\langle F_{hkl}\rangle</math> defined analogously as <math>\langle I_{hkl}\rangle</math>.


Line 39: Line 53:
We can plot (Diederichs <ref name="Di06">K. Diederichs (2006). Some aspects of quantitative analysis and correction of radiation damage. Acta Cryst D62, 96-101 [http://strucbio.biologie.uni-konstanz.de/strucbio/files/Diederichs_ActaD62_96.pdf]</ref>)
We can plot (Diederichs <ref name="Di06">K. Diederichs (2006). Some aspects of quantitative analysis and correction of radiation damage. Acta Cryst D62, 96-101 [http://strucbio.biologie.uni-konstanz.de/strucbio/files/Diederichs_ActaD62_96.pdf]</ref>)


<math>
 
R_{d} = \frac{\sum_{hkl} \sum_{|i-j|=d} \vert I_{hkl,i} - I_{hkl,j}\vert}{\sum_{hkl} \sum_{|i-j|=d} (I_{hkl,i} + I_{hkl,j})/2}
: <math>
R_{d} = \frac{\sum_{hkl} \sum_{|i-j|=d} \vert I_{hkl,i} - I_{hkl,j}\vert}{\sum_{hkl} \sum_{|i-j|=d} (I_{hkl,i} + I_{hkl,j})/2}
</math>
</math>


which gives us the average R-factor of two reflections measured d frames apart. As long as the plot is parallel to the x axis there is no radiation damage. As soon as the plot starts to rise, we see that there's a systematical error contribution due to radiation damage.
which gives us the average R-factor of two reflections measured d frames apart. As long as the plot is parallel to the x axis there is no radiation damage. As soon as the plot starts to rise, we see that there's a systematical error contribution due to radiation damage.
Line 48: Line 64:


To my knowledge, the only program that implements this currently (December 2008) is [[xds:XDSSTAT|XDSSTAT]].
To my knowledge, the only program that implements this currently (December 2008) is [[xds:XDSSTAT|XDSSTAT]].
=== Comparing two sets of structure factor amplitudes or intensities ===
The following is symmetric, and suitable for comparing two data sets, or two model amplitudes:
: <math>
R_{scale}=\frac{\sum_{hkl}\vert F_{hkl,i}-F_{hkl,j}\vert}{0.5\sum_{hkl} F_{hkl,i}+F_{hkl,j}}
</math>
for amplitudes, and analogously for intensities.


=== Model quality indicators ===
=== Model quality indicators ===
* R and [[iucr:Free_R_factor|R<sub>free</sub>]] : the formula for both is  
* R and [[iucr:Free_R_factor|R<sub>free</sub>]] : the formula for both is  
<math>
 
R=\frac{\sum_{hkl}\vert F_{hkl}^{obs}-F_{hkl}^{calc}\vert}{\sum_{hkl} F_{hkl}^{obs}}
 
</math>
: <math>
<br>
R=\frac{\sum_{hkl}\vert F_{hkl}^{obs}-F_{hkl}^{calc}\vert}{\sum_{hkl} F_{hkl}^{obs}}
<br>
</math>
 
 
where <math>F_{hkl}^{obs}</math> and <math>F_{hkl}^{calc}</math> have to be scaled w.r.t. each other. R and R<sub>free</sub> differ in the set of reflections they are calculated from: R is calculated for the [[working set]], whereas R<sub>free</sub> is calculated for the [[test set]].
where <math>F_{hkl}^{obs}</math> and <math>F_{hkl}^{calc}</math> have to be scaled w.r.t. each other. R and R<sub>free</sub> differ in the set of reflections they are calculated from: R is calculated for the [[working set]], whereas R<sub>free</sub> is calculated for the [[test set]].
== what do R-factors try to measure, and how to interpret their values? ==
* relative deviation of
=== Data quality ===
* typical values: ...
=== Model quality ===


==== Relation between R and R<sub>free</sub> as a function of resolution ====


==== Relation between R and R<sub>free</sub> as a function of resolution ====
* The PDBe provide a service to plot many different statistical properties in the PDB against other properties. The link is http://www.ebi.ac.uk/pdbe-as/pdbestatistics/PDBeStatistics.jsp . You can see that there is an option of RDiff which is the difference between R and R-Free for all structures that contain both data.  Take a look at this first. There is a second parameter which you can set to resolution and this will allow you to draw a plot that you want. This will draw a 3D isometric plot which you can scale, and pick data points to view particular entries.
* formula from Kleywegt and Jones (2002): R<sub>free</sub> = 1.065*R + 0.036
* plot with empirical data: http://xray.bmc.uu.se/gerard/supmat/rfree2000/rfminusr_vs_resolution.gif
* many more plots: http://xray.bmc.uu.se/gerard/supmat/rfree2000
* harry plotter (java): http://xray.bmc.uu.se/gerard/supmat/rfree2000/plotter.html
* When the resolution is plotted on a logarithmic scale, the most frequent values (modes) are practically linear functions allowing their easy interpolation / extrapolation as (Urzhumtsev et al, 2009)
        mode(R) = 0.091*ln(resolution) + 0.134
        mode(Rfree-R)  = 0.024*ln(resolution) + 0.020


References:
References:
Line 72: Line 104:


* GJ Kleywegt and TA Jones (2002). Homo Crystallographicus - Quo vadis? Structure 10, 465-472. (reprint from http://xray.bmc.uu.se/cgi-bin/gerard/reprint_mailer.pl?pref=65)
* GJ Kleywegt and TA Jones (2002). Homo Crystallographicus - Quo vadis? Structure 10, 465-472. (reprint from http://xray.bmc.uu.se/cgi-bin/gerard/reprint_mailer.pl?pref=65)
- formula from that paper: R<sub>free</sub> = 1.065*R + 0.036


- plot with empirical data: http://xray.bmc.uu.se/gerard/supmat/rfree2000/rfminusr_vs_resolution.gif
* Urzhumtsev, Afonine & Adams (2009) Acta Cryst., D65, 1283-1291.
 
- many more plots: http://xray.bmc.uu.se/gerard/supmat/rfree2000
 
- harry plotter (java): http://xray.bmc.uu.se/gerard/supmat/rfree2000/plotter.html


== what kinds of problems exist with these indicators? ==
== what kinds of problems exist with these indicators? ==
Line 86: Line 113:


* Sets of reflections used for calculating R<sub>free</sub> should be maintained throughout a project. This is nicely discussed at http://www.bmsc.washington.edu/people/merritt/xplor/rfree_example.html . Note that none of the programs mentioned for selecting thin shells will allow you to extend the set of shells to higher resolution if you want to preserve your existing R-free set.
* Sets of reflections used for calculating R<sub>free</sub> should be maintained throughout a project. This is nicely discussed at http://www.bmsc.washington.edu/people/merritt/xplor/rfree_example.html . Note that none of the programs mentioned for selecting thin shells will allow you to extend the set of shells to higher resolution if you want to preserve your existing R-free set.
* R-values and twinning: [http://www.ysbl.york.ac.uk/refmac/papers/Rfactor.pdf Garib N. Murshudov (2011) "Some properties of crystallographic reliability index - Rfactor: effect of twinning" Appl. Comput. Math., V.10, N.2, 2011, pp.250-261]. From the paper, the R-value table for random models is:
      twinning  twinning not
      modelled  modelled
twin  0.41      0.49
normal 0.52      0.58
Another paper which investigates the properties of R-values in the presence of twinning is [http://journals.iucr.org/d/issues/2013/07/00/ba5190/index.html P. R. Evans and G. N. Murshudov (2013) "How good are my data and what is the resolution?" Acta Cryst. (2013). D69, 1204-1214]. As the title indicates, this paper discusses at what resolution the data should be cut. One important finding is that a perfect model gives an R value of 42.0% (for a perfect twin, 29.1%) against pure noise. This tells us that a model that gives significantly lower R<sub>free</sub> in the (current) high resolution shell may benefit from including higher resolution data.
* R-values and [[pseudo-translation]]: if you have pseudotranslation you should be aware that if you solve the structure by molecular replacement, starting R factors could be 70-80%.
* data R-values are not meaningful at high resolution. This is discussed by [http://strucbio.biologie.uni-konstanz.de//strucbio/files/karplus2012_science.pdf Karplus and Diederichs (2012) "Linking crystallographic data and model quality". ''Science'' '''336''', 1030]


==Notes==
==Notes==
<references/>
<references/>
Cookies help us deliver our services. By using our services, you agree to our use of cookies.

Navigation menu