Performance: Difference between revisions

Line 8:

#*in COLSPOT, the phi values at the borders between JOBs are less accurate (in particular if the mosaicity is high), and the same reflection may be listed twice in SPOT.XDS if it extends over the border between JOBs. The latter effect may be mitigated by having as many SPOT_RANGEs as JOBs, and leaving gaps between the SPOT_RANGEs; see [[Problems#IDXREF_produces_too_long_axes]].

# combining these two keywords gives the highest performance (see [[2VB1#XDS_processing]] for an example). As a rough guide, I'd choose them to be approximately equal; an even number for MAXIMUM_NUMBER_OF_PROCESSORS should be chosen because that fits better with usual hardware. If in doubt, use a lower number for MAXIMUM_NUMBER_OF_JOBS than for MAXIMUM_NUMBER_OF_PROCESSORS. Since 2017, XDS has an automatic feature that divides up the available cores into JOBS (operating system processes) running with multiple cores each. You use this automatic feature when you run <code>xds_par</code>, and don't specify MAXIMUM_NUMBER_OF_JOBS.

# NUMBER_OF_IMAGES_IN_CACHE avoids repeated (3-fold) reading of data frames in the INTEGRATE task during processing of a batch of frames. This comes at the expense of memory (RAM) and is discussed in [[Eiger]]. The default is DELPHI/OSCILLATION_RANGE+1 and is usually adequate. Only on low-memory systems (e.g a 8GB RAM machine for processing Eiger 16M data collected with 0.1° oscillation range, at DELPHI=5 and MAXIMUM_NUMBER_OF_JOBS=1) should this be set to 0, to conserve memory and avoid slow processing due to thrashing, or even killed XDS processes. If the cache size of a process exceeds 8GB, XDS will print a warning, and in that case the user has to explicitly include a NUMBER_OF_IMAGES_IN_CACHE=<desired number> line in XDS.INP, to confirm that actually so much memory should be used. '''Typically, you don't specify NUMBER_OF_IMAGES_IN_CACHE'''.

# XDS with the MAXIMUM_NUMBER_OF_JOBS and CLUSTER_NODES keywords can use [[Performance#Cluster|several machines]]. This requires some setup as explained at the bottom of [http://www.mpimf-heidelberg.mpg.de/~kabsch/xds/html_doc/downloading.html].

# Hyperthreading (SMT), if available, is often beneficial. A "virtual" core has only about 20% performance of a "physical" core but it comes at no cost - you just have to switch it on in the BIOS of the machine.

# some overcommitting of resources (i.e. MAXIMUM_NUMBER_OF_PROCESSORS * MAXIMUM_NUMBER_OF_JOBS > number of cores) may be beneficial; you'll have to play with these two parameters since this depends on the actual hardware. If MAXIMUM_NUMBER_OF_PROCESSORS * MAXIMUM_NUMBER_OF_JOBS is >4096 (the default in RHEL), you may have to adjust the maxproc limit of your shell; in bash: <code>ulimit -u unlimited</code>.

# the next thing to consider is [http://www.mpimf-heidelberg.mpg.de/~kabsch/xds/html_doc/xds_parameters.html#DELPHI= DELPHI] together with [http://www.mpimf-heidelberg.mpg.de/~kabsch/xds/html_doc/xds_parameters.html#OSCILLATION_RANGE= OSCILLATION_RANGE]: if DELPHI (the rotation range of a ''batch'' of frames) is an integer multiple of MAXIMUM_NUMBER_OF_PROCESSORS * OSCILLATION_RANGE that would be good because it nicely balances the usage of the threads. For this purpose, you may want to change (if possible, raise) the value of DELPHI (default is 5 degrees). If you are doing fine-slicing then mis-balancing of threads is not an issue - but for those users who want to collect 1° frames (which I think is not the best way nowadays ...) it should be a consideration. Additional consideration: the total number of frames should be an integer multiple of the intended number of frames in a batch. Example: 360 frames of 0.5° can be processed on a 8-core machine optimally by specifying DELPHI=4, since then there are 8 frames in a batch and the complete dataset has 45 batches. For weak data one should consider raising DELPHI to 12; that would give 15 batches. A trick: if you want to use DELPHI=8 in this situation then just specify DATA_RANGE=1 368 (pretending 23 batches of 8°) instead of DATA_RANGE=1 360 . XDS will complain about the missing 8 frames, but that has no adverse effects except that no FRAME.cbf will be produced. All of this doesn't matter for a single dataset, but for mass processing of datasets it does make a difference.

# performance-wise, I/O also plays a role because as soon as you run 24 or so processes, a single Gigabit ethernet connection may be limiting. OTOH shell-level parallelization smoothes the load.

# REFINE(INTEGRATE)= ! (empty list) makes INTEGRATE go much faster through the frames, since frames are processed less often when analyzing a batch of frames, and no geometry refinement takes place.

# REFINE(INTEGRATE)= ! (empty list) makes INTEGRATE go much faster through the frames, since frames are processed less often when analyzing a batch of frames, and no geometry refinement takes place. TRUSTED_REGION=0 X results in fast processing if X is less than 1.4142 because that reduces the number of pixels of a rectangular detector that will be evaluated; X should of course be chosen such that it does not result in omission of useful data. For fast processing, the defaults for NUMBER_OF_PROFILE_GRID_POINTS_ALONG_ALPHA/BETA and NUMBER_OF_PROFILE_GRID_POINTS_ALONG_GAMMA should be used.

~~# XDS with the MAXIMUM_NUMBER_OF_JOBS and CLUSTER_NODES keywords can use [[Performance#Cluster|several machines]]~~. ~~This requires some setup as explained at~~ the ~~bottom~~ of ~~[http://www.mpimf-heidelberg.mpg.de/~kabsch/xds/html_doc/downloading.html].~~

~~# Hyperthreading (SMT), if available, is often beneficial. A "virtual" core has only about 20% performance~~ of a ~~"physical" core but it comes at no cost - you just have to switch~~ it on in ~~the BIOS~~ of the ~~machine~~.

== Cluster ==

@@ Line 8: / Line 8: @@
 #*in COLSPOT, the phi values at the borders between JOBs are less accurate (in particular if the mosaicity is high), and the same reflection may be listed twice in SPOT.XDS if it extends over the border between JOBs. The latter effect may be mitigated by having as many SPOT_RANGEs as JOBs, and leaving gaps between the SPOT_RANGEs; see [[Problems#IDXREF_produces_too_long_axes]].
 # combining these two keywords gives the highest performance (see [[2VB1#XDS_processing]] for an example). As a rough guide, I'd choose them to be approximately equal; an even number for MAXIMUM_NUMBER_OF_PROCESSORS should be chosen because that fits better with usual hardware. If in doubt, use a lower number for MAXIMUM_NUMBER_OF_JOBS than for MAXIMUM_NUMBER_OF_PROCESSORS. Since 2017, XDS has an automatic feature that divides up the available cores into JOBS (operating system processes) running with multiple cores each. You use this automatic feature when you run <code>xds_par</code>, and don't specify MAXIMUM_NUMBER_OF_JOBS.
 # NUMBER_OF_IMAGES_IN_CACHE avoids repeated (3-fold) reading of data frames in the INTEGRATE task during processing of a batch of frames. This comes at the expense of memory (RAM) and is discussed in [[Eiger]]. The default is DELPHI/OSCILLATION_RANGE+1 and is usually adequate. Only on low-memory systems (e.g a 8GB RAM machine for processing Eiger 16M data collected with 0.1° oscillation range, at DELPHI=5 and MAXIMUM_NUMBER_OF_JOBS=1) should this be set to 0, to conserve memory and avoid slow processing due to thrashing, or even killed XDS processes. If the cache size of a process exceeds 8GB, XDS will print a warning, and in that case the user has to explicitly include a NUMBER_OF_IMAGES_IN_CACHE=<desired number> line in XDS.INP, to confirm that actually so much memory should be used. '''Typically, you don't specify NUMBER_OF_IMAGES_IN_CACHE'''.
+# XDS with the MAXIMUM_NUMBER_OF_JOBS and CLUSTER_NODES keywords can use [[Performance#Cluster|several machines]]. This requires some setup as explained at the bottom of [http://www.mpimf-heidelberg.mpg.de/~kabsch/xds/html_doc/downloading.html].
+# Hyperthreading (SMT), if available, is often beneficial. A "virtual" core has only about 20% performance of a "physical" core but it comes at no cost - you just have to switch it on in the BIOS of the machine.
 # some overcommitting of resources (i.e. MAXIMUM_NUMBER_OF_PROCESSORS * MAXIMUM_NUMBER_OF_JOBS > number of cores) may be beneficial; you'll have to play with these two parameters since this depends on the actual hardware. If MAXIMUM_NUMBER_OF_PROCESSORS * MAXIMUM_NUMBER_OF_JOBS is >4096 (the default in RHEL), you may have to adjust the maxproc limit of your shell; in bash: <code>ulimit -u unlimited</code>.
 # the next thing to consider is [http://www.mpimf-heidelberg.mpg.de/~kabsch/xds/html_doc/xds_parameters.html#DELPHI= DELPHI] together with [http://www.mpimf-heidelberg.mpg.de/~kabsch/xds/html_doc/xds_parameters.html#OSCILLATION_RANGE= OSCILLATION_RANGE]: if DELPHI (the rotation range of a ''batch'' of frames) is an integer multiple of MAXIMUM_NUMBER_OF_PROCESSORS * OSCILLATION_RANGE that would be good because it nicely balances the usage of the threads. For this purpose, you may want to change (if possible, raise) the value of DELPHI (default is 5 degrees). If you are doing fine-slicing then mis-balancing of threads is not an issue - but for those users who want to collect 1° frames (which I think is not the best way nowadays ...) it should be a consideration. Additional consideration: the total number of frames should be an integer multiple of the intended number of frames in a batch. Example: 360 frames of 0.5° can be processed on a 8-core machine optimally by specifying DELPHI=4, since then there are 8 frames in a batch and the complete dataset has 45 batches. For weak data one should consider raising DELPHI to 12; that would give 15 batches. A trick: if you want to use DELPHI=8 in this situation then just specify DATA_RANGE=1 368 (pretending 23 batches of 8°) instead of DATA_RANGE=1 360 . XDS will complain about the missing 8 frames, but that has no adverse effects except that no FRAME.cbf will be produced. All of this doesn't matter for a single dataset, but for mass processing of datasets it does make a difference.
 # performance-wise, I/O also plays a role because as soon as you run 24 or so processes, a single Gigabit ethernet connection may be limiting. OTOH shell-level parallelization smoothes the load.
-# REFINE(INTEGRATE)= ! (empty list) makes INTEGRATE go much faster through the frames, since frames are processed less often when analyzing a batch of frames, and no geometry refinement takes place.
+# REFINE(INTEGRATE)= ! (empty list) makes INTEGRATE go much faster through the frames, since frames are processed less often when analyzing a batch of frames, and no geometry refinement takes place. TRUSTED_REGION=0 X results in fast processing if X is less than 1.4142 because that reduces the number of pixels of a rectangular detector that will be evaluated; X should of course be chosen such that it does not result in omission of useful data. For fast processing, the defaults for NUMBER_OF_PROFILE_GRID_POINTS_ALONG_ALPHA/BETA and NUMBER_OF_PROFILE_GRID_POINTS_ALONG_GAMMA should be used.
-# XDS with the MAXIMUM_NUMBER_OF_JOBS and CLUSTER_NODES keywords can use [[Performance#Cluster|several machines]]. This requires some setup as explained at the bottom of [http://www.mpimf-heidelberg.mpg.de/~kabsch/xds/html_doc/downloading.html].
-# Hyperthreading (SMT), if available, is often beneficial. A "virtual" core has only about 20% performance of a "physical" core but it comes at no cost - you just have to switch it on in the BIOS of the machine.
 == Cluster ==