Performance: Difference between revisions

(2 intermediate revisions by the same user not shown)

Line 13:

# some overcommitting of resources (i.e. MAXIMUM_NUMBER_OF_PROCESSORS * MAXIMUM_NUMBER_OF_JOBS > number of cores) may be beneficial; you'll have to play with these two parameters since this depends on the actual hardware. If MAXIMUM_NUMBER_OF_PROCESSORS * MAXIMUM_NUMBER_OF_JOBS is >4096 (the default in RHEL), you may have to adjust the maxproc limit of your shell; in bash: <code>ulimit -u unlimited</code>.

# the next thing to consider is [http://xds.mpimf-heidelberg.mpg.de/html_doc/xds_parameters.html#DELPHI= DELPHI] together with [http://xds.mpimf-heidelberg.mpg.de/html_doc/xds_parameters.html#OSCILLATION_RANGE= OSCILLATION_RANGE]: if DELPHI (the rotation range of a ''batch'' of frames) is an integer multiple of MAXIMUM_NUMBER_OF_PROCESSORS * OSCILLATION_RANGE that would be good because it nicely balances the usage of the threads. For this purpose, you may want to change (if possible, raise) the value of DELPHI (default is 5 degrees). If you are doing fine-slicing then mis-balancing of threads is not an issue - but for those users who want to collect 1° frames (which I think is not the best way nowadays ...) it should be a consideration. Additional consideration: the total number of frames should be an integer multiple of the intended number of frames in a batch. Example: 360 frames of 0.5° can be processed on a 8-core machine optimally by specifying DELPHI=4, since then there are 8 frames in a batch and the complete dataset has 45 batches. For weak data one should consider raising DELPHI to 12; that would give 15 batches. A trick: if you want to use DELPHI=8 in this situation then just specify DATA_RANGE=1 368 (pretending 23 batches of 8°) instead of DATA_RANGE=1 360 . XDS will complain about the missing 8 frames, but that has no adverse effects except that no FRAME.cbf will be produced. All of this doesn't matter for a single dataset, but for mass processing of datasets it does make a difference.

# performance-wise, I/O also plays a role because as soon as you run 24 or so processes, a single Gigabit ethernet connection may be limiting. OTOH shell-level parallelization smoothes the load.

# performance-wise, I/O also plays a role because as soon as you run 24 or so processes, a single Gigabit ethernet connection may be limiting. OTOH shell-level parallelization smoothes the load. With BUILTs of XDS before 20191211, you could experiment with an environment variable FORT_BUFFERED that when set to TRUE (in ~/.bashrc or better systemwide, with <code>export FORT_BUFFERED=TRUE</code>), will result in faster I/O. Attention: on some installations this has resulted in XDS crashes, so if you get crashes with error message <code>forrtl: severe (67): input statement requires too much data</code> then don't use this option! Since BUILT=20191211, a new version of the ifort compiler is used which should fix the underlying compiler bug. This and later BUILTs employ the option -assume buffered_io, and the environment variable is no longer needed for fast I/O.

# REFINE(INTEGRATE)= ! (empty list) makes INTEGRATE go much faster through the frames, since frames are processed less often when analyzing a batch of frames, and no geometry refinement takes place. TRUSTED_REGION=0 X results in fast processing if X is less than 1.4142 because that reduces the number of pixels of a rectangular detector that will be evaluated; X should of course be chosen such that it does not result in omission of useful data. For fast processing, the defaults for NUMBER_OF_PROFILE_GRID_POINTS_ALONG_ALPHA/BETA and NUMBER_OF_PROFILE_GRID_POINTS_ALONG_GAMMA should be used.

== Cluster ==

If a cluster of computers is available that allow login ~~- without asking for a password - by~~ <code>ssh</code> and that have NFS-mounted the relevant directories under the same paths, one can use the [http://xds.mpimf-heidelberg.mpg.de/html_doc/xds_parameters.html#CLUSTER_NODES= CLUSTER_NODES=] keyword in XDS.INP.

If a cluster of computers is available that allow login with <code>ssh</code> without asking for a password, and that have NFS-mounted the relevant directories under the same paths, one can use the [http://xds.mpimf-heidelberg.mpg.de/html_doc/xds_parameters.html#CLUSTER_NODES= CLUSTER_NODES=] keyword in XDS.INP. Attention: if <code>CLUSTER_NODES=a b c</code> then task 1 will run on node b, task 2 on node c, and task 3 on node a. This is due to the logic in the current (BUILT=20191015) version of <code>forkxds</code> (part of XDS package).

If the other computers are not reachable by <code>ssh</code>, but coupled with a batch queueing system, then the forkxds script of the XDS distribution has to be modified: the node names are not relevant, and the <code>ssh</code> invocation has to be replaced by a <code>qsub</code> invocation - see [[Cluster Installation]].

@@ Line 13: / Line 13: @@
 # some overcommitting of resources (i.e. MAXIMUM_NUMBER_OF_PROCESSORS * MAXIMUM_NUMBER_OF_JOBS > number of cores) may be beneficial; you'll have to play with these two parameters since this depends on the actual hardware. If MAXIMUM_NUMBER_OF_PROCESSORS * MAXIMUM_NUMBER_OF_JOBS is >4096 (the default in RHEL), you may have to adjust the maxproc limit of your shell; in bash: <code>ulimit -u unlimited</code>.
 # the next thing to consider is [http://xds.mpimf-heidelberg.mpg.de/html_doc/xds_parameters.html#DELPHI= DELPHI] together with [http://xds.mpimf-heidelberg.mpg.de/html_doc/xds_parameters.html#OSCILLATION_RANGE= OSCILLATION_RANGE]: if DELPHI (the rotation range of a ''batch'' of frames) is an integer multiple of MAXIMUM_NUMBER_OF_PROCESSORS * OSCILLATION_RANGE that would be good because it nicely balances the usage of the threads. For this purpose, you may want to change (if possible, raise) the value of DELPHI (default is 5 degrees). If you are doing fine-slicing then mis-balancing of threads is not an issue - but for those users who want to collect 1° frames (which I think is not the best way nowadays ...) it should be a consideration. Additional consideration: the total number of frames should be an integer multiple of the intended number of frames in a batch. Example: 360 frames of 0.5° can be processed on a 8-core machine optimally by specifying DELPHI=4, since then there are 8 frames in a batch and the complete dataset has 45 batches. For weak data one should consider raising DELPHI to 12; that would give 15 batches. A trick: if you want to use DELPHI=8 in this situation then just specify DATA_RANGE=1 368 (pretending 23 batches of 8°) instead of DATA_RANGE=1 360 . XDS will complain about the missing 8 frames, but that has no adverse effects except that no FRAME.cbf will be produced. All of this doesn't matter for a single dataset, but for mass processing of datasets it does make a difference.
-# performance-wise, I/O also plays a role because as soon as you run 24 or so processes, a single Gigabit ethernet connection may be limiting. OTOH shell-level parallelization smoothes the load.
+# performance-wise, I/O also plays a role because as soon as you run 24 or so processes, a single Gigabit ethernet connection may be limiting. OTOH shell-level parallelization smoothes the load. With BUILTs of XDS before 20191211, you could experiment with an environment variable FORT_BUFFERED that when set to TRUE (in ~/.bashrc or better systemwide, with <code>export FORT_BUFFERED=TRUE</code>), will result in faster I/O. Attention: on some installations this has resulted in XDS crashes, so if you get crashes with error message <code>forrtl: severe (67): input statement requires too much data</code> then don't use this option! Since BUILT=20191211, a new version of the ifort compiler is used which should fix the underlying compiler bug. This and later BUILTs employ the option -assume buffered_io, and the environment variable is no longer needed for fast I/O.
 # REFINE(INTEGRATE)= ! (empty list) makes INTEGRATE go much faster through the frames, since frames are processed less often when analyzing a batch of frames, and no geometry refinement takes place. TRUSTED_REGION=0 X results in fast processing if X is less than 1.4142 because that reduces the number of pixels of a rectangular detector that will be evaluated; X should of course be chosen such that it does not result in omission of useful data. For fast processing, the defaults for NUMBER_OF_PROFILE_GRID_POINTS_ALONG_ALPHA/BETA and NUMBER_OF_PROFILE_GRID_POINTS_ALONG_GAMMA should be used.
 == Cluster ==
-If a cluster of computers is available that allow login - without asking for a password - by <code>ssh</code> and that have NFS-mounted the relevant directories under the same paths, one can use the [http://xds.mpimf-heidelberg.mpg.de/html_doc/xds_parameters.html#CLUSTER_NODES= CLUSTER_NODES=] keyword in XDS.INP.
+If a cluster of computers is available that allow login with <code>ssh</code> without asking for a password, and that have NFS-mounted the relevant directories under the same paths, one can use the [http://xds.mpimf-heidelberg.mpg.de/html_doc/xds_parameters.html#CLUSTER_NODES= CLUSTER_NODES=] keyword in XDS.INP. Attention: if <code>CLUSTER_NODES=a b c</code> then task 1 will run on node b, task 2 on node c, and task 3 on node a. This is due to the logic in the current (BUILT=20191015) version of <code>forkxds</code> (part of XDS package).
 If the other computers are not reachable by <code>ssh</code>, but coupled with a batch queueing system, then the forkxds script of the XDS distribution has to be modified: the node names are not relevant, and the <code>ssh</code> invocation has to be replaced by a <code>qsub</code> invocation - see [[Cluster Installation]].