Performance: Difference between revisions

6,267 bytes removed ,  5 April 2018
no edit summary
No edit summary
Line 20: Line 20:
If a cluster of computers is available that allow login - without asking for a password - by <code>ssh</code> and that have NFS-mounted the relevant directories under the same paths, one can use the [http://xds.mpimf-heidelberg.mpg.de/html_doc/xds_parameters.html#CLUSTER_NODES= CLUSTER_NODES=] keyword in XDS.INP.  
If a cluster of computers is available that allow login - without asking for a password - by <code>ssh</code> and that have NFS-mounted the relevant directories under the same paths, one can use the [http://xds.mpimf-heidelberg.mpg.de/html_doc/xds_parameters.html#CLUSTER_NODES= CLUSTER_NODES=] keyword in XDS.INP.  


If the other computers are not reachable by <code>ssh</code>, but coupled with a batch queueing system, then the forkxds script of the XDS distribution has to be modified: the node names are not relevant, and the <code>ssh</code> invocation has to be replaced by a <code>qsub</code> invocation. An example script will be available soon.
If the other computers are not reachable by <code>ssh</code>, but coupled with a batch queueing system, then the forkxds script of the XDS distribution has to be modified: the node names are not relevant, and the <code>ssh</code> invocation has to be replaced by a <code>qsub</code> invocation.
 
== Multi-socket machines ==
 
Multi-socket machines consist of several nodes each comprising several CPUs and some amount of memory. The nodes are connected by specialized hardware (sometimes called interconnect or bus) that transports data between the nodes. Typically, node-local memory is faster to read and write than memory on a different node. This NUMA (non-uniform memory architecture) setup has consequences for the performance when used for running XDS jobs.
 
In particular, good performance is obtained if MAXIMUM_NUMBER_OF_JOBS is chosen as the number of nodes, and MAXIMUM_NUMBER_OF_PROCESSORS is chosen as the number of CPU cores (physical + virtual) of each socket. One then has to take care that each job ends up on its own socket. The following scripts do this. Please note that <tt>numactl</tt> has to be installed.
<pre>
#!/bin/bash
#                      forkcolspot
#
# enables  multi-tasking by splitting the COLSPOT step of
# xds into independent jobs. Each job is carried out by the
# Fortran program mcolspot or mcolspot_par started by this
# script as a background process with a different set of
# input parameters.
#
# 'forkcolspot' is called by xds or xds_par in the COLSPOT
# step using the Fortran instruction
# CALL SYSTEM('forkcolspot ntask maxcpu'),
#    ntask  ::total number of jobs
#  maxcpu  ::maximum number of processors used by each job
#
# Clearly, this can only work if forkcolspot, mcolspot, and
# mcolspot_par are correctly installed in the search path
# for executables.
#
# W.Kabsch and K.Rohm    Version Februar 2005
# NOTE: No blanks allowed adjacent to the = signs !!!
 
# K.Diederichs 3/2016 NUMA affinity added
#export KMP_AFFINITY="verbose"
maxnode=`numactl -H|awk '/available/{print $2-1}'`
#echo highest node is $maxnode
 
ntask=$1  #total number of jobs
maxcpu=$2 #maximum number of processors used by each job
  #maxcpu=1: use 'mcolspot' (single processor)
  #maxcpu>1: use 'mcolspot_par' (openmp version)
 
pids=""                    #list of background process ID's
itask=1
inode=0  # initialize inode
while test $itask -le $ntask
do
# KD modification: which node?
  let inode=$inode+1
  if [ $inode -gt $maxnode ]
      then let inode=0
  fi
#end modification
  if [ $maxcpu -gt 1 ]
      then echo "$itask" | numactl --cpunodebind=$inode mcolspot_par &
      else echo "$itask" | mcolspot    &
  fi
  pids="$pids $!"  #append id of the background process just started
 
  itask=`expr $itask + 1`
done
trap "kill -15 $pids" 2 15  # 2:Control-C; 15:kill
wait  #wait for all background processes issued by this shell
rm -f mcolspot.tmp  #this temporary file was generated by ads
</pre>
 
<pre>
#!/bin/bash
#                      forkintegrate
#
# enables  multi-tasking by splitting the INTEGRATE step of
# xds into independent jobs. Each job is carried out by the
# Fortran program mintegrate or mintegrate_par started by
# this script as a background process with a different set
# of input parameters.
#
# 'forkintegrate' is called by xds (or xds_par) in the
# INTEGRATE step using the Fortran instruction
# CALL SYSTEM('forkintegrate fframe ni ntask niba0 maxcpu'),
#    fframe ::id number of the first data image
#    ni    ::number of images in the data set
#    ntask  ::total number of jobs
#    niba0  ::minimum number of images in a batch
#    maxcpu ::maximum number of processors used by each job
#
# Clearly, this can only work if forkintegrate, mintegrate,
# and mintegrate_par are correctly installed in the search
# path for executables.
#
# W.Kabsch and K.Rohm    Version Februar 2005
# NOTE: No blanks allowed adjacent to the = signs !!!
 
# K.Diederichs 3/2016 NUMA affinity added
#export KMP_AFFINITY="verbose"
maxnode=`numactl -H|awk '/available/{print $2-1}'`
#echo highest node is $maxnode
 
 
fframe=$1 #id number of the first image
ni=$2    #number of images in the data set
ntask=$3  #total number of jobs
niba0=$4  #minimum number of images in a batch
maxcpu=$5 #maximum number of processors used by each job
  #maxcpu=1: use 'mintegrate' (single processor)
  #maxcpu>1: use 'mintegrate_par' (openmp version)
 
minitask=$(($ni / $ntask)) #minimum number of images in a job
mtask=$(($ni % $ntask))    #number of jobs with minitask+1 images
pids=""                    #list of background process ID's
nba=0
litask=0
itask=1
inode=0  # initialize inode
while test $itask -le $ntask
do
# KD modification: which node?
  let inode=$inode+1
  if [ $inode -gt $maxnode ]
      then let inode=0
  fi
#end modification
  if [ $itask -gt $mtask ]
      then nitask=$minitask
      else nitask=$(($minitask + 1))
  fi
  fitask=`expr $litask + 1`
  litask=`expr $litask + $nitask`
  if [ $nitask -lt $niba0 ]
      then n=$nitask
      else n=$niba0
  fi
  if [ $n -lt 1 ]
      then n=1
  fi
  nbatask=$(($nitask / $n))
  nba=`expr $nba + $nbatask`
  image1=$(($fframe + $fitask - 1)) #id number of the first image
 
  if [ $maxcpu -gt 1 ]
      then echo "$image1 $nitask $itask $nbatask" | numactl --cpunodebind=$inode mintegrate_par &
      else echo "$image1 $nitask $itask $nbatask" | mintegrate    &
  fi
  pids="$pids $!"  #append id of the background process just started
 
  itask=`expr $itask + 1`
done
trap "kill -15 $pids" 2 15  # 2:Control-C; 15:kill
wait  #wait for all background processes issued by this shell
rm -f mintegrate.tmp  #this temporary file was generated by xds
</pre>
 
As an alternative to <tt>numactl</tt>, one may use <tt>taskset</tt> or <tt>KMP_AFFINITY</tt>.
 
If <tt>[https://github.com/RRZE-HPC/likwid/wiki likwid]</tt> would be used instead of <tt>numactl</tt> one could have much better control of affinity groups.
 
In my tests on a 4-socket machine, the difference between runs with the original scripts and the NUMA-aware ones was a reduction of wallclock time by about 8%. With a 2-socket machine, I saw a <1% effect. But this will depend very much on the specific hardware.
 
 
== Multi-socket machines in a cluster ==
 
In that case, I'd suggest to modify e.g. <tt>forkcolspot_cluster</tt> to not run <tt>mcolspot_par</tt> directly on the remote machine, but rather to run a script on that machine that checks the number of nodes, and runs <tt>mcolspot_par</tt> on the right node.


== processing compressed data ==
== processing compressed data ==
2,652

edits