Performance: Difference between revisions

6,222 bytes added ,  19 March 2016
cluster and NUMA
(cluster and NUMA)
Line 15: Line 15:
# Hyperthreading (SMT), if available on Intel CPUs, is beneficial. A "virtual" core has only about 20% performance of a "physical" core but it comes at no cost - you just have to switch it on in the BIOS of the machine.
# Hyperthreading (SMT), if available on Intel CPUs, is beneficial. A "virtual" core has only about 20% performance of a "physical" core but it comes at no cost - you just have to switch it on in the BIOS of the machine.
# The 64-bit binaries generally are a bit faster than the 32-bit binaries (but that's not specific for XDS). The latter are no longer distributed anyway.
# The 64-bit binaries generally are a bit faster than the 32-bit binaries (but that's not specific for XDS). The latter are no longer distributed anyway.
== Cluster ==
In a cluster of computers, one has to modify the <tt>forkcolspot</tt> and <tt>forkintegrate</tt> scripts (which are part of the XDS distribution) as shown in [http://xds.mpimf-heidelberg.mpg.de/html_doc/forkcolspot_cluster forkcolspot_cluster] and [http://xds.mpimf-heidelberg.mpg.de/html_doc/forkintegrate_cluster forkintegrate_cluster]. The names of computers called "node1" to "node4" in these example scripts have to be replaced with actual computers names reachable by ssh and having NFS-mounted the relevant directories under the same paths.
== Multi-socket machines ==
Multi-socket machines consist of several nodes each comprising several CPUs and some amount of memory. The nodes are connected by specialized hardware (sometimes called interconnect or bus) that transports data between the nodes. Typically, node-local memory is faster to read and write than memory on a different node. This NUMA (non-uniform memory architecture) setup has consequences for the performance when used for running XDS jobs.
In particular, good performance is obtained if MAXIMUM_NUMBER_OF_JOBS is chosen as the number of nodes, and MAXIMUM_NUMBER_OF_PROCESSORS is chosen as the number of CPU cores (physical + virtual) of each socket. One then has to take care that each job ends up on its own socket. The following scripts do this. Please note that <pre>numactl</pre> has to be installed.
<pre>
#!/bin/bash
#                      forkcolspot
#
# enables  multi-tasking by splitting the COLSPOT step of
# xds into independent jobs. Each job is carried out by the
# Fortran program mcolspot or mcolspot_par started by this
# script as a background process with a different set of
# input parameters.
#
# 'forkcolspot' is called by xds or xds_par in the COLSPOT
# step using the Fortran instruction
# CALL SYSTEM('forkcolspot ntask maxcpu'),
#    ntask  ::total number of jobs
#  maxcpu  ::maximum number of processors used by each job
#
# Clearly, this can only work if forkcolspot, mcolspot, and
# mcolspot_par are correctly installed in the search path
# for executables.
#
# W.Kabsch and K.Rohm    Version Februar 2005
# NOTE: No blanks allowed adjacent to the = signs !!!
# K.Diederichs 3/2016 NUMA affinity added
#export KMP_AFFINITY="verbose"
maxnode=`numactl -H|awk '/available/{print $2-1}'`
#echo highest node is $maxnode
ntask=$1  #total number of jobs
maxcpu=$2 #maximum number of processors used by each job
  #maxcpu=1: use 'mcolspot' (single processor)
  #maxcpu>1: use 'mcolspot_par' (openmp version)
pids=""                    #list of background process ID's
itask=1
inode=0  # initialize inode
while test $itask -le $ntask
do
# KD modification: which node?
  let inode=$inode+1
  if [ $inode -gt $maxnode ]
      then let inode=0
  fi
#end modification
  if [ $maxcpu -gt 1 ]
      then echo "$itask" | numactl --cpunodebind=$inode mcolspot_par &
      else echo "$itask" | mcolspot    &
  fi
  pids="$pids $!"  #append id of the background process just started
  itask=`expr $itask + 1`
done
trap "kill -15 $pids" 2 15  # 2:Control-C; 15:kill
wait  #wait for all background processes issued by this shell
rm -f mcolspot.tmp  #this temporary file was generated by ads
</pre>
<pre>
#!/bin/bash
#                      forkintegrate
#
# enables  multi-tasking by splitting the INTEGRATE step of
# xds into independent jobs. Each job is carried out by the
# Fortran program mintegrate or mintegrate_par started by
# this script as a background process with a different set
# of input parameters.
#
# 'forkintegrate' is called by xds (or xds_par) in the
# INTEGRATE step using the Fortran instruction
# CALL SYSTEM('forkintegrate fframe ni ntask niba0 maxcpu'),
#    fframe ::id number of the first data image
#    ni    ::number of images in the data set
#    ntask  ::total number of jobs
#    niba0  ::minimum number of images in a batch
#    maxcpu ::maximum number of processors used by each job
#
# Clearly, this can only work if forkintegrate, mintegrate,
# and mintegrate_par are correctly installed in the search
# path for executables.
#
# W.Kabsch and K.Rohm    Version Februar 2005
# NOTE: No blanks allowed adjacent to the = signs !!!
# K.Diederichs 3/2016 NUMA affinity added
#export KMP_AFFINITY="verbose"
maxnode=`numactl -H|awk '/available/{print $2-1}'`
#echo highest node is $maxnode
fframe=$1 #id number of the first image
ni=$2    #number of images in the data set
ntask=$3  #total number of jobs
niba0=$4  #minimum number of images in a batch
maxcpu=$5 #maximum number of processors used by each job
  #maxcpu=1: use 'mintegrate' (single processor)
  #maxcpu>1: use 'mintegrate_par' (openmp version)
minitask=$(($ni / $ntask)) #minimum number of images in a job
mtask=$(($ni % $ntask))    #number of jobs with minitask+1 images
pids=""                    #list of background process ID's
nba=0
litask=0
itask=1
inode=0  # initialize inode
while test $itask -le $ntask
do
# KD modification: which node?
  let inode=$inode+1
  if [ $inode -gt $maxnode ]
      then let inode=0
  fi
#end modification
  if [ $itask -gt $mtask ]
      then nitask=$minitask
      else nitask=$(($minitask + 1))
  fi
  fitask=`expr $litask + 1`
  litask=`expr $litask + $nitask`
  if [ $nitask -lt $niba0 ]
      then n=$nitask
      else n=$niba0
  fi
  if [ $n -lt 1 ]
      then n=1
  fi
  nbatask=$(($nitask / $n))
  nba=`expr $nba + $nbatask`
  image1=$(($fframe + $fitask - 1)) #id number of the first image
  if [ $maxcpu -gt 1 ]
      then echo "$image1 $nitask $itask $nbatask" | numactl --cpunodebind=$inode mintegrate_par &
      else echo "$image1 $nitask $itask $nbatask" | mintegrate    &
  fi
  pids="$pids $!"  #append id of the background process just started
  itask=`expr $itask + 1`
done
trap "kill -15 $pids" 2 15  # 2:Control-C; 15:kill
wait  #wait for all background processes issued by this shell
rm -f mintegrate.tmp  #this temporary file was generated by xds
</pre>
The scripts could be modified to use <tt>[https://github.com/RRZE-HPC/likwid/wiki likwid]</tt> instead of <tt>numactl</tt> which would allow for better control of affinity groups. Alternatively, one may use <tt>taskset</tt> or <tt>KMP_AFFINITY</tt>.
2,652

edits