Test set: Difference between revisions

Line 4:

The need to find a good compromise for the size of the test set has been discussed by Axel Brunger in a "Methods in Enzymology" (1997) paper. He writes:

In all test calculations to date, the free R value has

In all test calculations to date, the free R value has been highly correlated with the phase accuracy of the atomic model. In

been highly correlated with the phase accuracy of the atomic model. In

practice, about 5-10% of the observed diffraction data (chosen at random from the unique reflections) become sequestered in the

practice, about 5-10% of the observed diffraction data (chosen at random

test set. The size of the test set is a compromise between the desire to minimize statistical fluctuations of the free R value

from the unique reflections) become sequestered in the test set. The size

and the need to avoid a deleterious effect on the atomic model by omission of too much experimental data.

of the test set is a compromise between the desire to minimize statistical

fluctuations of the free R value and the need to avoid a deleterious effect

on the atomic model by omission of too much experimental data.

==How precise is the estimate of Rfree for a certain number of test set reflections?==

Line 19:

Line 16:

==How many reflections do you need to get a good estimate of the sigmaA values (as a function of resolution) needed to calibrate the likelihood target?==

Randy Read's rule of thumb is this: "My impression is that you gain relatively little by adding more reflections, once you have a total of about 1000 or at most 2000 in the cross-validation set. However, giving up more than 10% of the data is probably a bad idea, even if the sigmaA estimates are somewhat less accurate. I've had reasonable results refining against data sets of 3000-5000 reflections, setting aside only 10% (i.e. 300-500 reflections) for cross-validation.

Randy Read's rule of thumb is this:

My impression is that you gain relatively little by adding more reflections, once you have a total of about 1000 or at most 2000

in the cross-validation set. However, giving up more than 10% of the data is probably a bad idea, even if the sigmaA estimates

are somewhat less accurate. I've had reasonable results refining against data sets of 3000-5000 reflections, setting aside

only 10% (i.e. 300-500 reflections) for cross-validation.

So here's the recipe I would use, for what it's worth:

<10000 reflections: set aside 10%

10000-20000 reflections: set aside 1000 reflections

Line 27:

Line 28:

>40000 reflections: set aside 2000 reflections

I'm sure that with a bit of thought someone could come up with a smooth function that achieves something similar, but it seems adequate."

I'm sure that with a bit of thought someone could come up with a smooth function that achieves something similar, but it seems

adequate.

Line 33:

Line 35:

K.Cowtan (2005) J. Appl. Cryst. 38, 193-198. Likelihood weighting of partial structure factors using spline coefficients

http://journals.iucr.org/j/issues/2005/01/00/zm5022/zm5022.pdf

Kevin Cowtan summarizes: the result is that the number of reflections required varies with the level of error in the model (i.e. with sigmaa). For refinement close to convergence, one could use about 250 free reflections per sigmaA parameter (so 1500 would probably do). However when dealing with a very poor initial model, or, for example, when using sigmaA in a density modification calculation, then it may be necessary to use all the reflections.

Kevin Cowtan summarizes:

the result is that the number of reflections required varies with the level of error in the model (i.e. with sigmaa). For

refinement close to convergence, one could use about 250 free reflections per sigmaA parameter (so 1500 would probably do).

However when dealing with a very poor initial model, or, for example, when using sigmaA in a density modification calculation,

then it may be necessary to use all the reflections.

This agrees with Randy's explanation:

"In case anyone is interested, the reason this is a bit simplistic is that the number of reflections you need depends on how good your model is. If you look at the contribution to the likelihood function from one reflection, it is very broad for low sigmaA values and becomes sharper as the sigmaA values increase. This means that, if the true value of sigmaA is low, you need more reflections to get a precise estimate than if the true sigmaA value is high. This happens because, if sigmaA is low, any value of Fo could be expected for a particular Fc because the model predicts the data poorly, but if sigmaA is high, then there is a very restricted range of possible values for Fo given Fc. So to get stable refinement from a very poor model, you might need to set aside a larger number of reflections for cross-validated sigmaA estimation. Later on, when the model is better, you could afford to absorb some of those reflections into the working set."

In case anyone is interested, the reason this is a bit simplistic is that the number of reflections you need depends on how

good your model is. If you look at the contribution to the likelihood function from one reflection, it is very broad for

low sigmaA values and becomes sharper as the sigmaA values increase. This means that, if the true value of sigmaA is low,

you need more reflections to get a precise estimate than if the true sigmaA value is high. This happens because, if sigmaA

is low, any value of Fo could be expected for a particular Fc because the model predicts the data poorly, but if sigmaA is

high, then there is a very restricted range of possible values for Fo given Fc. So to get stable refinement from a very poor

model, you might need to set aside a larger number of reflections for cross-validated sigmaA estimation. Later on, when the

model is better, you could afford to absorb some of those reflections into the working set.

@@ Line 4: / Line 4: @@
 The need to find a good compromise for the size of the test set has been discussed by Axel Brunger in a "Methods in Enzymology" (1997) paper. He writes:
-  In all test calculations to date, the free R value has
+  In all test calculations to date, the free R value has been highly correlated with the phase accuracy of the atomic model. In
- been highly correlated with the phase accuracy of the atomic model. In
+  practice, about 5-10% of the observed diffraction data (chosen at random from the unique reflections) become sequestered in the
-  practice, about 5-10% of the observed diffraction data (chosen at random
+ test set. The size of the test set is a compromise between the desire to minimize statistical fluctuations of the free R value
- from the unique reflections) become sequestered in the test set. The size
+ and the need to avoid a deleterious effect on the atomic model by omission of too much experimental data.
- of the test set is a compromise between the desire to minimize statistical
- fluctuations of the free R value and the need to avoid a deleterious effect
- on the atomic model by omission of too much experimental data.
 ==How precise is the estimate of Rfree for a certain number of test set reflections?==
@@ Line 19: / Line 16: @@
 ==How many reflections do you need to get a good estimate of the sigmaA values (as a function of resolution) needed to calibrate the likelihood target?==
-Randy Read's rule of thumb is this:  "My impression is that you gain relatively little by adding more reflections, once you have a total of about 1000 or at most 2000 in the cross-validation set.  However, giving up more than 10% of the data is probably a bad idea, even if the sigmaA estimates are somewhat less accurate.  I've had reasonable results refining against data sets of 3000-5000 reflections, setting aside only 10% (i.e. 300-500 reflections) for cross-validation.
+Randy Read's rule of thumb is this:
+  My impression is that you gain relatively little by adding more reflections, once you have a total of about 1000 or at most 2000
+ in the cross-validation set.  However, giving up more than 10% of the data is probably a bad idea, even if the sigmaA estimates
+ are somewhat less accurate.  I've had reasonable results refining against data sets of 3000-5000 reflections, setting aside
+ only 10% (i.e. 300-500 reflections) for cross-validation.
-So here's the recipe I would use, for what it's worth:
+ So here's the recipe I would use, for what it's worth:
     <10000 reflections:        set aside 10%
 -20000 reflections:  set aside 1000 reflections
@@ Line 27: / Line 28: @@
     >40000 reflections:        set aside 2000 reflections
-I'm sure that with a bit of thought someone could come up with a smooth function that achieves something similar, but it seems adequate."
+ I'm sure that with a bit of thought someone could come up with a smooth function that achieves something similar, but it seems
+ adequate.
@@ Line 33: / Line 35: @@
 K.Cowtan (2005) J. Appl. Cryst. 38, 193-198. Likelihood weighting of partial structure factors using spline coefficients
 http://journals.iucr.org/j/issues/2005/01/00/zm5022/zm5022.pdf
-Kevin Cowtan summarizes: the result is that the number of reflections required varies with the level of error in the model (i.e. with sigmaa). For refinement close to convergence, one could use about 250 free reflections per sigmaA parameter (so 1500 would probably do). However when dealing with a very poor initial model, or, for example, when using sigmaA in a density modification calculation, then it may be necessary to use all the reflections.
+Kevin Cowtan summarizes:
+ the result is that the number of reflections required varies with the level of error in the model (i.e. with sigmaa). For
+ refinement close to convergence, one could use about 250 free reflections per sigmaA parameter (so 1500 would probably do).
+ However when dealing with a very poor initial model, or, for example, when using sigmaA in a density modification calculation,
+ then it may be necessary to use all the reflections.
 This agrees with Randy's explanation:
-"In case anyone is interested, the reason this is a bit simplistic is that the number of reflections you need depends on how good your model is.  If you look at the contribution to the likelihood function from one reflection, it is very broad for low sigmaA values and becomes sharper as the sigmaA values increase.  This means that, if the true value of sigmaA is low, you need more reflections to get a precise estimate than if the true sigmaA value is high.  This happens because, if sigmaA is low, any value of Fo could be expected for a particular Fc because the model predicts the data poorly, but if sigmaA is high, then there is a very restricted range of possible values for Fo given Fc.  So to get stable refinement from a very poor model, you might need to set aside a larger number of reflections for cross-validated sigmaA estimation.  Later on, when the model is better, you could afford to absorb some of those reflections into the working set."
+ In case anyone is interested, the reason this is a bit simplistic is that the number of reflections you need depends on how
+ good your model is.  If you look at the contribution to the likelihood function from one reflection, it is very broad for
+ low sigmaA values and becomes sharper as the sigmaA values increase.  This means that, if the true value of sigmaA is low,
+ you need more reflections to get a precise estimate than if the true sigmaA value is high.  This happens because, if sigmaA
+ is low, any value of Fo could be expected for a particular Fc because the model predicts the data poorly, but if sigmaA is
+ high, then there is a very restricted range of possible values for Fo given Fc.  So to get stable refinement from a very poor
+ model, you might need to set aside a larger number of reflections for cross-validated sigmaA estimation.  Later on, when the
+ model is better, you could afford to absorb some of those reflections into the working set.