R-factors: Difference between revisions

Revision as of 09:31, 29 May 2016

Definitions

Data quality indicators

In the following, all sums over hkl extend only over unique reflections with more than one observation!

R_sym and R_merge - the formula for both is:

[math]\displaystyle{ 
 R = \frac{\sum_{hkl} \sum_{j} \vert I_{hkl,j}-\langle I_{hkl}\rangle\vert}{\sum_{hkl} \sum_{j}I_{hkl,j}}
  }[/math]

where [math]\displaystyle{ \langle I_{hkl}\rangle }[/math] is the average of symmetry- (or Friedel-) related observations of a unique reflection.

It can be shown that this formula results in higher R-factors when the redundancy is higher (Diederichs and Karplus ^[1]). In other words, low-redundancy datasets appear better than high-redundancy ones, which obviously violates the intention of having an indicator of data quality!

Redundancy-independant version of the above:

[math]\displaystyle{ 
 R_{meas} = \frac{\sum_{hkl} \sqrt \frac{n}{n-1} \sum_{j=1}^{n} \vert I_{hkl,j}-\langle I_{hkl}\rangle\vert}{\sum_{hkl} \sum_{j}I_{hkl,j}}
  }[/math]

which unfortunately results in higher (but more realistic) numerical values than R_sym / R_merge (Diederichs and Karplus ^[1] , Weiss and Hilgenfeld ^[2]).

measuring precision of averaged intensities/amplitudes

for intensities use (Weiss ^[3])

[math]\displaystyle{ 
 R_{p.i.m.} = \frac{\sum_{hkl} \sqrt \frac{1}{n-1} \sum_{j=1}^{n} \vert I_{hkl,j}-\langle I_{hkl}\rangle\vert}{\sum_{hkl} \sum_{j}I_{hkl,j}}
  }[/math]

R_mrgd-I (defined in Diederichs and Karplus ^[1]) only differs by a factor (FIXME: what is the factor? 0.5 or 1.4142 or ?) since it likewise takes the improvement in precision from multiplicity into account. R_split , which is what the X-FEL community uses, is the same as R_mrgd-I but that community seems not to be aware of this.

Similarly, one should use R_mrgd-F as a quality indicator for amplitudes ^[1], which may be calculated as:

[math]\displaystyle{ 
 R_{mrgd-F} = \frac{\sum_{hkl} \sqrt \frac{1}{n-1} \sum_{j=1}^{n} \vert F_{hkl,j}-\langle F_{hkl}\rangle\vert}{\sum_{hkl} \sum_{j}F_{hkl,j}}
  }[/math]

with [math]\displaystyle{ \langle F_{hkl}\rangle }[/math] defined analogously as [math]\displaystyle{ \langle I_{hkl}\rangle }[/math].

In the sums above, the summation omits those reflections with just one observation.

measuring radiation damage

We can plot (Diederichs ^[4])

[math]\displaystyle{ 
 R_{d} = \frac{\sum_{hkl} \sum_{|i-j|=d} \vert I_{hkl,i} - I_{hkl,j}\vert}{\sum_{hkl} \sum_{|i-j|=d} (I_{hkl,i} + I_{hkl,j})/2}
  }[/math]

which gives us the average R-factor of two reflections measured d frames apart. As long as the plot is parallel to the x axis there is no radiation damage. As soon as the plot starts to rise, we see that there's a systematical error contribution due to radiation damage.

Strong wiggles at very high d are irrelevant as only few reflections contribute.

To my knowledge, the only program that implements this currently (December 2008) is XDSSTAT.

Comparing two sets of structure factor amplitudes or intensities

The following is symmetric, and suitable for comparing two data sets, or two model amplitudes:

[math]\displaystyle{ 
 R_{scale}=\frac{\sum_{hkl}\vert F_{hkl,i}-F_{hkl,j}\vert}{0.5\sum_{hkl} F_{hkl,i}+F_{hkl,j}}
  }[/math]

for amplitudes, and analogously for intensities.

Model quality indicators

R and R_free : the formula for both is

[math]\displaystyle{ 
 R=\frac{\sum_{hkl}\vert F_{hkl}^{obs}-F_{hkl}^{calc}\vert}{\sum_{hkl} F_{hkl}^{obs}}
  }[/math]

where [math]\displaystyle{ F_{hkl}^{obs} }[/math] and [math]\displaystyle{ F_{hkl}^{calc} }[/math] have to be scaled w.r.t. each other. R and R_free differ in the set of reflections they are calculated from: R is calculated for the working set, whereas R_free is calculated for the test set.

Relation between R and R_free as a function of resolution

The PDBe provide a service to plot many different statistical properties in the PDB against other properties. The link is http://www.ebi.ac.uk/pdbe-as/pdbestatistics/PDBeStatistics.jsp . You can see that there is an option of RDiff which is the difference between R and R-Free for all structures that contain both data. Take a look at this first. There is a second parameter which you can set to resolution and this will allow you to draw a plot that you want. This will draw a 3D isometric plot which you can scale, and pick data points to view particular entries.
formula from Kleywegt and Jones (2002): R_free = 1.065*R + 0.036
plot with empirical data: http://xray.bmc.uu.se/gerard/supmat/rfree2000/rfminusr_vs_resolution.gif
many more plots: http://xray.bmc.uu.se/gerard/supmat/rfree2000
harry plotter (java): http://xray.bmc.uu.se/gerard/supmat/rfree2000/plotter.html
When the resolution is plotted on a logarithmic scale, the most frequent values (modes) are practically linear functions allowing their easy interpolation / extrapolation as (Urzhumtsev et al, 2009)

       mode(R) = 0.091*ln(resolution) + 0.134
       mode(Rfree-R)   = 0.024*ln(resolution) + 0.020

References:

Tickle IJ, Laskowski RA and Moss DS. Rfree and the Rfree Ratio. I. Derivation of Expected Values of Cross-Validation Residuals Used in Macromolecular Least-Squares Refinement. Acta Cryst. (1998). D54, 547-557 [5]

Tickle IJ, Laskowski RA and Moss DS. Rfree and the Rfree ratio. II. Calculation of the expected values and variances of cross-validation statistics in macromolecular least-squares refinement. Acta Cryst. (2000). D56, 442-450 [6]

GJ Kleywegt and TA Jones (2002). Homo Crystallographicus - Quo vadis? Structure 10, 465-472. (reprint from http://xray.bmc.uu.se/cgi-bin/gerard/reprint_mailer.pl?pref=65)

Urzhumtsev, Afonine & Adams (2009) Acta Cryst., D65, 1283-1291.

what kinds of problems exist with these indicators?

(R_sym / R_merge ) should not be used to judge data quality, R_meas should be used instead. The reason is that the former depend on multiplicity, whereas the latter doesn't.

R/R_free and NCS: reflections in work and test set are not independent if chosen randomly. It is better to choose the test set reflections in thin resolution shells. Since the twin related reflections have the same sin(theta)/lambda values they will not be split over the working and reference sets. DATAMAN from the Uppsala Software Factory and XPREP (a program which may be obtained from Bruker) offer this option. The "RFREE SHELL" command in sftools is another way to select thin shells. A disadvantage is the the maps may not be quite as good as when the free R reflections are selected randomly. (FIXME: which Phenix program does this?). A paper investigating this thoroughly is Fabiola, F., A. Korostelev, et al. (2006). "Bias in cross-validated free R factors: mitigation of the effects of non-crystallographic symmetry." Acta Cryst. D 62: 227-38.

Sets of reflections used for calculating R_free should be maintained throughout a project. This is nicely discussed at http://www.bmsc.washington.edu/people/merritt/xplor/rfree_example.html . Note that none of the programs mentioned for selecting thin shells will allow you to extend the set of shells to higher resolution if you want to preserve your existing R-free set.

R-values and twinning: Garib N. Murshudov (2011) "Some properties of crystallographic reliability index - Rfactor: effect of twinning" Appl. Comput. Math., V.10, N.2, 2011, pp.250-261. From the paper, the R-value table for random models is:

     twinning  twinning not
     modelled  modelled
twin   0.41      0.49
normal 0.52      0.58

Another paper which investigates the properties of R-values in the presence of twinning is P. R. Evans and G. N. Murshudov (2013) "How good are my data and what is the resolution?" Acta Cryst. (2013). D69, 1204-1214. As the title indicates, this paper discusses at what resolution the data should be cut. One important finding is that a perfect model gives an R value of 42.0% (for a perfect twin, 29.1%) against pure noise. This tells us that a model that gives significantly lower R_free in the (current) high resolution shell may benefit from including higher resolution data.

R-values and pseudo-translation: if you have pseudotranslation you should be aware that if you solve the structure by molecular replacement, starting R factors could be 70-80%.

data R-values are not meaningful at high resolution. This is discussed by Karplus and Diederichs (2012) "Linking crystallographic data and model quality". Science 336, 1030

Notes

↑ ^1.0 ^1.1 ^1.2 ^1.3 K. Diederichs and P.A. Karplus (1997). Improved R-factors for diffraction data analysis in macromolecular crystallography. Nature Struct. Biol. 4, 269-275 [1]
↑ M.S. Weiss and R. Hilgenfeld (1997) On the use of the merging R-factor as a quality indicator for X-ray data. J. Appl. Crystallogr. 30, 203-205[2]
↑ M.S. Weiss. Global indicators of X-ray data quality. J. Appl. Cryst. (2001). 34, 130-135 [3]
↑ K. Diederichs (2006). Some aspects of quantitative analysis and correction of radiation damage. Acta Cryst D62, 96-101 [4]

[DiKa97-1] 1.0 ^1.1 ^1.2 ^1.3 K. Diederichs and P.A. Karplus (1997). Improved R-factors for diffraction data analysis in macromolecular crystallography. Nature Struct. Biol. 4, 269-275 [1]

[WeHi97-2] M.S. Weiss and R. Hilgenfeld (1997) On the use of the merging R-factor as a quality indicator for X-ray data. J. Appl. Crystallogr. 30, 203-205[2]

[We01-3] M.S. Weiss. Global indicators of X-ray data quality. J. Appl. Cryst. (2001). 34, 130-135 [3]

[Di06-4] K. Diederichs (2006). Some aspects of quantitative analysis and correction of radiation damage. Acta Cryst D62, 96-101 [4]

[1]

[2]

[3]

[4]

@@ Line 95: / Line 95: @@
   twin   0.41      0.49
   normal 0.52      0.58
-Another paper which investigates the properties of R-values in the presence of twinning is [http://journals.iucr.org/d/issues/2013/07/00/ba5190/index.html P. R. Evans and G. N. Murshudov (2013) "How good are my data and what is the resolution?" Acta Cryst. (2013). D69, 1204-1214]. As the title indicates, this paper discusses at what resolution the data should be cut. One important finding is that a perfect model gives an R value of 42.0% (for a perfect twin, 29.1%) against pure noise. This suggests that a model that gives significantly lower R-values in the highest resolution shell may be improved by including higher resolution data.
+Another paper which investigates the properties of R-values in the presence of twinning is [http://journals.iucr.org/d/issues/2013/07/00/ba5190/index.html P. R. Evans and G. N. Murshudov (2013) "How good are my data and what is the resolution?" Acta Cryst. (2013). D69, 1204-1214]. As the title indicates, this paper discusses at what resolution the data should be cut. One important finding is that a perfect model gives an R value of 42.0% (for a perfect twin, 29.1%) against pure noise. This tells us that a model that gives significantly lower R<sub>free</sub> in the (current) high resolution shell may benefit from including higher resolution data.
 * R-values and [[pseudo-translation]]: if you have pseudotranslation you should be aware that if you solve the structure by molecular replacement, starting R factors could be 70-80%.

R-factors: Difference between revisions

Revision as of 09:31, 29 May 2016

Contents

Definitions

Data quality indicators

measuring precision of averaged intensities/amplitudes

measuring radiation damage

Comparing two sets of structure factor amplitudes or intensities

Model quality indicators

Relation between R and R_free as a function of resolution

what kinds of problems exist with these indicators?

Notes

Navigation menu

R-factors: Difference between revisions

Revision as of 09:31, 29 May 2016

Definitions

Data quality indicators

measuring precision of averaged intensities/amplitudes

measuring radiation damage

Comparing two sets of structure factor amplitudes or intensities

Model quality indicators

Relation between R and Rfree as a function of resolution

what kinds of problems exist with these indicators?

Notes

Navigation menu

Search

Relation between R and R_free as a function of resolution