Performance: Difference between revisions
No edit summary |
|||
Line 169: | Line 169: | ||
</pre> | </pre> | ||
As an alternative to <tt>numactl</tt>, one may use <tt>taskset</tt> or <tt>KMP_AFFINITY</tt>. | |||
If <tt>[https://github.com/RRZE-HPC/likwid/wiki likwid]</tt> would be used instead of <tt>numactl</tt> one could have much better control of affinity groups. | |||
In my tests on a 4-socket machine, the difference between runs with the original scripts and the NUMA-aware ones was a reduction of wallclock time by about 8%. With a 2-socket machine, I saw a <1% effect. But this will depend very much on the specific hardware. | |||
== Multi-socket machines in a cluster == | |||
In that case, I'd suggest to modify e.g. <tt>forkcolspot_cluster</tt> to not run <tt>mcolspot_par</tt> directly on the remote machine, but rather to run a script on that machine that checks the number of nodes, and runs <tt>mcolspot_par</tt> on the right node. |
Revision as of 23:01, 19 March 2016
Considerations
In the order of effect:
- XDS scales well (i.e. the wallclock time for data processing goes down when the number of available cores is increased) in the COLSPOT, IDXREF, INTEGRATE and CORRECT steps when using the MAXIMUM_NUMBER_OF_PROCESSORS keyword. This triggers program-level parallelization, using OpenMP threads.
- the program scales very well in the COLSPOT and INTEGRATE steps when using the MAXIMUM_NUMBER_OF_JOBS keyword. This triggers a shell-level parallelization. There is a slight penalty associated with high values of MAXIMUM_NUMBER_OF_JOBS= :
- in INTEGRATE, geometry refinement results are not transferred between JOBs: see Pathologies;
- in COLSPOT, the phi values at the borders between JOBs are less accurate (in particular if the mosaicity is high), and the same reflection may be listed twice in SPOT.XDS if it extends over the border between JOBs. The latter effect may be mitigated by having as many SPOT_RANGEs as JOBs, and leaving gaps between the SPOT_RANGEs; see Problems#IDXREF_produces_too_long_axes.
- combining these two keywords gives the highest performance in my experience (see 2VB1#XDS_processing for an example). As a rough guide, I'd choose them to be approximately equal; an even number for MAXIMUM_NUMBER_OF_PROCESSORS should be chosen because that fits better with usual hardware. If in doubt, use a lower number for MAXIMUM_NUMBER_OF_JOBS than for MAXIMUM_NUMBER_OF_PROCESSORS.
- some overcommitting of resources (i.e. MAXIMUM_NUMBER_OF_PROCESSORS * MAXIMUM_NUMBER_OF_JOBS > number of cores) is beneficial; you'll have to play with these two parameters.
- the next thing to consider is DELPHI together with OSCILLATION_RANGE: if DELPHI (the rotation range of a batch of frames) is an integer multiple of MAXIMUM_NUMBER_OF_PROCESSORS * OSCILLATION_RANGE that would be good because it nicely balances the usage of the threads. For this purpose, you may want to change (if possible, raise) the value of DELPHI (default is 5 degrees). If you are doing fine-slicing then mis-balancing of threads is not an issue - but for those users who want to collect 1° frames (which I think is not the best way nowadays ...) it should be a consideration. Additional consideration: the total number of frames should be an integer multiple of the intended number of frames in a batch. Example: 360 frames of 0.5° can be processed on a 8-core machine optimally by specifying DELPHI=4, since then there are 8 frames in a batch and the complete dataset has 45 batches. For weak data one should consider raising DELPHI to 12; that would give 15 batches. A trick: if you want to use DELPHI=8 in this situation then just specify DATA_RANGE=1 368 (pretending 23 batches of 8°) instead of DATA_RANGE=1 360 . XDS will complain about the missing 8 frames, but that has no adverse effects except that no FRAME.cbf will be produced. All of this doesn't matter for a single dataset, but for mass processing of datasets it does make a difference.
- performance-wise, I/O also plays a role because as soon as you run 24 or so processes then a single GB ethernet connection may be limiting. OTOH shell-level parallelization smoothes the load.
- REFINE(INTEGRATE)= ! (empty list) makes INTEGRATE go much faster through the frames, since frames are processed less often when analyzing a batch of frames, and no geometry refinement takes place.
- XDS with the MAXIMUM_NUMBER_OF_JOBS keyword can use several machines. This requires some setup as explained at the bottom of [1].
- Hyperthreading (SMT), if available on Intel CPUs, is beneficial. A "virtual" core has only about 20% performance of a "physical" core but it comes at no cost - you just have to switch it on in the BIOS of the machine.
- The 64-bit binaries generally are a bit faster than the 32-bit binaries (but that's not specific for XDS). The latter are no longer distributed anyway.
Cluster
In a cluster of computers, one has to modify the forkcolspot and forkintegrate scripts (which are part of the XDS distribution) as shown in forkcolspot_cluster and forkintegrate_cluster. The names of computers called "node1" to "node4" in these example scripts have to be replaced with actual computers names reachable by ssh and having NFS-mounted the relevant directories under the same paths.
Multi-socket machines
Multi-socket machines consist of several nodes each comprising several CPUs and some amount of memory. The nodes are connected by specialized hardware (sometimes called interconnect or bus) that transports data between the nodes. Typically, node-local memory is faster to read and write than memory on a different node. This NUMA (non-uniform memory architecture) setup has consequences for the performance when used for running XDS jobs.
In particular, good performance is obtained if MAXIMUM_NUMBER_OF_JOBS is chosen as the number of nodes, and MAXIMUM_NUMBER_OF_PROCESSORS is chosen as the number of CPU cores (physical + virtual) of each socket. One then has to take care that each job ends up on its own socket. The following scripts do this. Please note that numactl has to be installed.
#!/bin/bash # forkcolspot # # enables multi-tasking by splitting the COLSPOT step of # xds into independent jobs. Each job is carried out by the # Fortran program mcolspot or mcolspot_par started by this # script as a background process with a different set of # input parameters. # # 'forkcolspot' is called by xds or xds_par in the COLSPOT # step using the Fortran instruction # CALL SYSTEM('forkcolspot ntask maxcpu'), # ntask ::total number of jobs # maxcpu ::maximum number of processors used by each job # # Clearly, this can only work if forkcolspot, mcolspot, and # mcolspot_par are correctly installed in the search path # for executables. # # W.Kabsch and K.Rohm Version Februar 2005 # NOTE: No blanks allowed adjacent to the = signs !!! # K.Diederichs 3/2016 NUMA affinity added #export KMP_AFFINITY="verbose" maxnode=`numactl -H|awk '/available/{print $2-1}'` #echo highest node is $maxnode ntask=$1 #total number of jobs maxcpu=$2 #maximum number of processors used by each job #maxcpu=1: use 'mcolspot' (single processor) #maxcpu>1: use 'mcolspot_par' (openmp version) pids="" #list of background process ID's itask=1 inode=0 # initialize inode while test $itask -le $ntask do # KD modification: which node? let inode=$inode+1 if [ $inode -gt $maxnode ] then let inode=0 fi #end modification if [ $maxcpu -gt 1 ] then echo "$itask" | numactl --cpunodebind=$inode mcolspot_par & else echo "$itask" | mcolspot & fi pids="$pids $!" #append id of the background process just started itask=`expr $itask + 1` done trap "kill -15 $pids" 2 15 # 2:Control-C; 15:kill wait #wait for all background processes issued by this shell rm -f mcolspot.tmp #this temporary file was generated by ads
#!/bin/bash # forkintegrate # # enables multi-tasking by splitting the INTEGRATE step of # xds into independent jobs. Each job is carried out by the # Fortran program mintegrate or mintegrate_par started by # this script as a background process with a different set # of input parameters. # # 'forkintegrate' is called by xds (or xds_par) in the # INTEGRATE step using the Fortran instruction # CALL SYSTEM('forkintegrate fframe ni ntask niba0 maxcpu'), # fframe ::id number of the first data image # ni ::number of images in the data set # ntask ::total number of jobs # niba0 ::minimum number of images in a batch # maxcpu ::maximum number of processors used by each job # # Clearly, this can only work if forkintegrate, mintegrate, # and mintegrate_par are correctly installed in the search # path for executables. # # W.Kabsch and K.Rohm Version Februar 2005 # NOTE: No blanks allowed adjacent to the = signs !!! # K.Diederichs 3/2016 NUMA affinity added #export KMP_AFFINITY="verbose" maxnode=`numactl -H|awk '/available/{print $2-1}'` #echo highest node is $maxnode fframe=$1 #id number of the first image ni=$2 #number of images in the data set ntask=$3 #total number of jobs niba0=$4 #minimum number of images in a batch maxcpu=$5 #maximum number of processors used by each job #maxcpu=1: use 'mintegrate' (single processor) #maxcpu>1: use 'mintegrate_par' (openmp version) minitask=$(($ni / $ntask)) #minimum number of images in a job mtask=$(($ni % $ntask)) #number of jobs with minitask+1 images pids="" #list of background process ID's nba=0 litask=0 itask=1 inode=0 # initialize inode while test $itask -le $ntask do # KD modification: which node? let inode=$inode+1 if [ $inode -gt $maxnode ] then let inode=0 fi #end modification if [ $itask -gt $mtask ] then nitask=$minitask else nitask=$(($minitask + 1)) fi fitask=`expr $litask + 1` litask=`expr $litask + $nitask` if [ $nitask -lt $niba0 ] then n=$nitask else n=$niba0 fi if [ $n -lt 1 ] then n=1 fi nbatask=$(($nitask / $n)) nba=`expr $nba + $nbatask` image1=$(($fframe + $fitask - 1)) #id number of the first image if [ $maxcpu -gt 1 ] then echo "$image1 $nitask $itask $nbatask" | numactl --cpunodebind=$inode mintegrate_par & else echo "$image1 $nitask $itask $nbatask" | mintegrate & fi pids="$pids $!" #append id of the background process just started itask=`expr $itask + 1` done trap "kill -15 $pids" 2 15 # 2:Control-C; 15:kill wait #wait for all background processes issued by this shell rm -f mintegrate.tmp #this temporary file was generated by xds
As an alternative to numactl, one may use taskset or KMP_AFFINITY.
If likwid would be used instead of numactl one could have much better control of affinity groups.
In my tests on a 4-socket machine, the difference between runs with the original scripts and the NUMA-aware ones was a reduction of wallclock time by about 8%. With a 2-socket machine, I saw a <1% effect. But this will depend very much on the specific hardware.
Multi-socket machines in a cluster
In that case, I'd suggest to modify e.g. forkcolspot_cluster to not run mcolspot_par directly on the remote machine, but rather to run a script on that machine that checks the number of nodes, and runs mcolspot_par on the right node.