Performance: Difference between revisions

← Older edit Newer edit →

VisualWikitext

@@ Line 20: / Line 20: @@
 If a cluster of computers is available that allow login - without asking for a password - by <code>ssh</code> and that have NFS-mounted the relevant directories under the same paths, one can use the [http://xds.mpimf-heidelberg.mpg.de/html_doc/xds_parameters.html#CLUSTER_NODES= CLUSTER_NODES=] keyword in XDS.INP.
-If the other computers are not reachable by <code>ssh</code>, but coupled with a batch queueing system, then the forkxds script of the XDS distribution has to be modified: the node names are not relevant, and the <code>ssh</code> invocation has to be replaced by a <code>qsub</code> invocation. An example script will be available soon.
+If the other computers are not reachable by <code>ssh</code>, but coupled with a batch queueing system, then the forkxds script of the XDS distribution has to be modified: the node names are not relevant, and the <code>ssh</code> invocation has to be replaced by a <code>qsub</code> invocation.
-== Multi-socket machines ==
-Multi-socket machines consist of several nodes each comprising several CPUs and some amount of memory. The nodes are connected by specialized hardware (sometimes called interconnect or bus) that transports data between the nodes. Typically, node-local memory is faster to read and write than memory on a different node. This NUMA (non-uniform memory architecture) setup has consequences for the performance when used for running XDS jobs.
-In particular, good performance is obtained if MAXIMUM_NUMBER_OF_JOBS is chosen as the number of nodes, and MAXIMUM_NUMBER_OF_PROCESSORS is chosen as the number of CPU cores (physical + virtual) of each socket. One then has to take care that each job ends up on its own socket. The following scripts do this. Please note that <tt>numactl</tt> has to be installed.
-<pre>
-#!/bin/bash
-#                      forkcolspot
-#
-# enables  multi-tasking by splitting the COLSPOT step of
-# xds into independent jobs. Each job is carried out by the
-# Fortran program mcolspot or mcolspot_par started by this
-# script as a background process with a different set of
-# input parameters.
-#
-# 'forkcolspot' is called by xds or xds_par in the COLSPOT
-# step using the Fortran instruction
-# CALL SYSTEM('forkcolspot ntask maxcpu'),
-#    ntask  ::total number of jobs
-#   maxcpu  ::maximum number of processors used by each job
-#
-# Clearly, this can only work if forkcolspot, mcolspot, and
-# mcolspot_par are correctly installed in the search path
-# for executables.
-#
-# W.Kabsch and K.Rohm     Version Februar 2005
-# NOTE: No blanks allowed adjacent to the = signs !!!
-# K.Diederichs 3/2016 NUMA affinity added
-#export KMP_AFFINITY="verbose"
-maxnode=`numactl -H|awk '/available/{print $2-1}'`
-#echo highest node is $maxnode
-ntask=$1  #total number of jobs
-maxcpu=$2 #maximum number of processors used by each job
-	   #maxcpu=1: use 'mcolspot' (single processor)
-	   #maxcpu>1: use 'mcolspot_par' (openmp version)
-pids=""                    #list of background process ID's
-itask=1
-inode=0   # initialize inode
-while test $itask -le $ntask
-do
-# KD modification: which node?
-   let inode=$inode+1
-   if [ $inode -gt $maxnode ]
-      then let inode=0
-   fi
-#end modification
-   if [ $maxcpu -gt 1 ]
-      then echo "$itask" | numactl --cpunodebind=$inode mcolspot_par &
-      else echo "$itask" | mcolspot     &
-   fi
-   pids="$pids $!"  #append id of the background process just started
-   itask=`expr $itask + 1`
-done
-trap "kill -15 $pids" 2 15  # 2:Control-C; 15:kill
-wait  #wait for all background processes issued by this shell
-rm -f mcolspot.tmp  #this temporary file was generated by ads
-</pre>
-<pre>
-#!/bin/bash
-#                      forkintegrate
-#
-# enables  multi-tasking by splitting the INTEGRATE step of
-# xds into independent jobs. Each job is carried out by the
-# Fortran program mintegrate or mintegrate_par started by
-# this script as a background process with a different set
-# of input parameters.
-#
-# 'forkintegrate' is called by xds (or xds_par) in the
-# INTEGRATE step using the Fortran instruction
-# CALL SYSTEM('forkintegrate fframe ni ntask niba0 maxcpu'),
-#    fframe ::id number of the first data image
-#    ni     ::number of images in the data set
-#    ntask  ::total number of jobs
-#    niba0  ::minimum number of images in a batch
-#    maxcpu ::maximum number of processors used by each job
-#
-# Clearly, this can only work if forkintegrate, mintegrate,
-# and mintegrate_par are correctly installed in the search
-# path for executables.
-#
-# W.Kabsch and K.Rohm     Version Februar 2005
-# NOTE: No blanks allowed adjacent to the = signs !!!
-# K.Diederichs 3/2016 NUMA affinity added
-#export KMP_AFFINITY="verbose"
-maxnode=`numactl -H|awk '/available/{print $2-1}'`
-#echo highest node is $maxnode
-fframe=$1 #id number of the first image
-ni=$2     #number of images in the data set
-ntask=$3  #total number of jobs
-niba0=$4  #minimum number of images in a batch
-maxcpu=$5 #maximum number of processors used by each job
-	   #maxcpu=1: use 'mintegrate' (single processor)
-	   #maxcpu>1: use 'mintegrate_par' (openmp version)
-minitask=$(($ni / $ntask)) #minimum number of images in a job
-mtask=$(($ni % $ntask))    #number of jobs with minitask+1 images
-pids=""                    #list of background process ID's
-nba=0
-litask=0
-itask=1
-inode=0   # initialize inode
-while test $itask -le $ntask
-do
-# KD modification: which node?
-   let inode=$inode+1
-   if [ $inode -gt $maxnode ]
-      then let inode=0
-   fi
-#end modification
-   if [ $itask -gt $mtask ]
-      then nitask=$minitask
-      else nitask=$(($minitask + 1))
-   fi
-   fitask=`expr $litask + 1`
-   litask=`expr $litask + $nitask`
-   if [ $nitask -lt $niba0 ]
-      then n=$nitask
-      else n=$niba0
-   fi
-   if [ $n -lt 1 ]
-      then n=1
-   fi
-   nbatask=$(($nitask / $n))
-   nba=`expr $nba + $nbatask`
-   image1=$(($fframe + $fitask - 1)) #id number of the first image
-   if [ $maxcpu -gt 1 ]
-      then echo "$image1 $nitask $itask $nbatask" | numactl --cpunodebind=$inode mintegrate_par &
-      else echo "$image1 $nitask $itask $nbatask" | mintegrate     &
-   fi
-   pids="$pids $!"  #append id of the background process just started
-   itask=`expr $itask + 1`
-done
-trap "kill -15 $pids" 2 15  # 2:Control-C; 15:kill
-wait  #wait for all background processes issued by this shell
-rm -f mintegrate.tmp  #this temporary file was generated by xds
-</pre>
-As an alternative to <tt>numactl</tt>, one may use <tt>taskset</tt> or <tt>KMP_AFFINITY</tt>.
-If <tt>[https://github.com/RRZE-HPC/likwid/wiki likwid]</tt> would be used instead of <tt>numactl</tt> one could have much better control of affinity groups.
-In my tests on a 4-socket machine, the difference between runs with the original scripts and the NUMA-aware ones was a reduction of wallclock time by about 8%. With a 2-socket machine, I saw a <1% effect. But this will depend very much on the specific hardware.
-== Multi-socket machines in a cluster ==
-In that case, I'd suggest to modify e.g. <tt>forkcolspot_cluster</tt> to not run <tt>mcolspot_par</tt> directly on the remote machine, but rather to run a script on that machine that checks the number of nodes, and runs <tt>mcolspot_par</tt> on the right node.
 == processing compressed data ==