2,684
edits
(cluster and NUMA) |
|||
Line 15: | Line 15: | ||
# Hyperthreading (SMT), if available on Intel CPUs, is beneficial. A "virtual" core has only about 20% performance of a "physical" core but it comes at no cost - you just have to switch it on in the BIOS of the machine. | # Hyperthreading (SMT), if available on Intel CPUs, is beneficial. A "virtual" core has only about 20% performance of a "physical" core but it comes at no cost - you just have to switch it on in the BIOS of the machine. | ||
# The 64-bit binaries generally are a bit faster than the 32-bit binaries (but that's not specific for XDS). The latter are no longer distributed anyway. | # The 64-bit binaries generally are a bit faster than the 32-bit binaries (but that's not specific for XDS). The latter are no longer distributed anyway. | ||
== Cluster == | |||
In a cluster of computers, one has to modify the <tt>forkcolspot</tt> and <tt>forkintegrate</tt> scripts (which are part of the XDS distribution) as shown in [http://xds.mpimf-heidelberg.mpg.de/html_doc/forkcolspot_cluster forkcolspot_cluster] and [http://xds.mpimf-heidelberg.mpg.de/html_doc/forkintegrate_cluster forkintegrate_cluster]. The names of computers called "node1" to "node4" in these example scripts have to be replaced with actual computers names reachable by ssh and having NFS-mounted the relevant directories under the same paths. | |||
== Multi-socket machines == | |||
Multi-socket machines consist of several nodes each comprising several CPUs and some amount of memory. The nodes are connected by specialized hardware (sometimes called interconnect or bus) that transports data between the nodes. Typically, node-local memory is faster to read and write than memory on a different node. This NUMA (non-uniform memory architecture) setup has consequences for the performance when used for running XDS jobs. | |||
In particular, good performance is obtained if MAXIMUM_NUMBER_OF_JOBS is chosen as the number of nodes, and MAXIMUM_NUMBER_OF_PROCESSORS is chosen as the number of CPU cores (physical + virtual) of each socket. One then has to take care that each job ends up on its own socket. The following scripts do this. Please note that <pre>numactl</pre> has to be installed. | |||
<pre> | |||
#!/bin/bash | |||
# forkcolspot | |||
# | |||
# enables multi-tasking by splitting the COLSPOT step of | |||
# xds into independent jobs. Each job is carried out by the | |||
# Fortran program mcolspot or mcolspot_par started by this | |||
# script as a background process with a different set of | |||
# input parameters. | |||
# | |||
# 'forkcolspot' is called by xds or xds_par in the COLSPOT | |||
# step using the Fortran instruction | |||
# CALL SYSTEM('forkcolspot ntask maxcpu'), | |||
# ntask ::total number of jobs | |||
# maxcpu ::maximum number of processors used by each job | |||
# | |||
# Clearly, this can only work if forkcolspot, mcolspot, and | |||
# mcolspot_par are correctly installed in the search path | |||
# for executables. | |||
# | |||
# W.Kabsch and K.Rohm Version Februar 2005 | |||
# NOTE: No blanks allowed adjacent to the = signs !!! | |||
# K.Diederichs 3/2016 NUMA affinity added | |||
#export KMP_AFFINITY="verbose" | |||
maxnode=`numactl -H|awk '/available/{print $2-1}'` | |||
#echo highest node is $maxnode | |||
ntask=$1 #total number of jobs | |||
maxcpu=$2 #maximum number of processors used by each job | |||
#maxcpu=1: use 'mcolspot' (single processor) | |||
#maxcpu>1: use 'mcolspot_par' (openmp version) | |||
pids="" #list of background process ID's | |||
itask=1 | |||
inode=0 # initialize inode | |||
while test $itask -le $ntask | |||
do | |||
# KD modification: which node? | |||
let inode=$inode+1 | |||
if [ $inode -gt $maxnode ] | |||
then let inode=0 | |||
fi | |||
#end modification | |||
if [ $maxcpu -gt 1 ] | |||
then echo "$itask" | numactl --cpunodebind=$inode mcolspot_par & | |||
else echo "$itask" | mcolspot & | |||
fi | |||
pids="$pids $!" #append id of the background process just started | |||
itask=`expr $itask + 1` | |||
done | |||
trap "kill -15 $pids" 2 15 # 2:Control-C; 15:kill | |||
wait #wait for all background processes issued by this shell | |||
rm -f mcolspot.tmp #this temporary file was generated by ads | |||
</pre> | |||
<pre> | |||
#!/bin/bash | |||
# forkintegrate | |||
# | |||
# enables multi-tasking by splitting the INTEGRATE step of | |||
# xds into independent jobs. Each job is carried out by the | |||
# Fortran program mintegrate or mintegrate_par started by | |||
# this script as a background process with a different set | |||
# of input parameters. | |||
# | |||
# 'forkintegrate' is called by xds (or xds_par) in the | |||
# INTEGRATE step using the Fortran instruction | |||
# CALL SYSTEM('forkintegrate fframe ni ntask niba0 maxcpu'), | |||
# fframe ::id number of the first data image | |||
# ni ::number of images in the data set | |||
# ntask ::total number of jobs | |||
# niba0 ::minimum number of images in a batch | |||
# maxcpu ::maximum number of processors used by each job | |||
# | |||
# Clearly, this can only work if forkintegrate, mintegrate, | |||
# and mintegrate_par are correctly installed in the search | |||
# path for executables. | |||
# | |||
# W.Kabsch and K.Rohm Version Februar 2005 | |||
# NOTE: No blanks allowed adjacent to the = signs !!! | |||
# K.Diederichs 3/2016 NUMA affinity added | |||
#export KMP_AFFINITY="verbose" | |||
maxnode=`numactl -H|awk '/available/{print $2-1}'` | |||
#echo highest node is $maxnode | |||
fframe=$1 #id number of the first image | |||
ni=$2 #number of images in the data set | |||
ntask=$3 #total number of jobs | |||
niba0=$4 #minimum number of images in a batch | |||
maxcpu=$5 #maximum number of processors used by each job | |||
#maxcpu=1: use 'mintegrate' (single processor) | |||
#maxcpu>1: use 'mintegrate_par' (openmp version) | |||
minitask=$(($ni / $ntask)) #minimum number of images in a job | |||
mtask=$(($ni % $ntask)) #number of jobs with minitask+1 images | |||
pids="" #list of background process ID's | |||
nba=0 | |||
litask=0 | |||
itask=1 | |||
inode=0 # initialize inode | |||
while test $itask -le $ntask | |||
do | |||
# KD modification: which node? | |||
let inode=$inode+1 | |||
if [ $inode -gt $maxnode ] | |||
then let inode=0 | |||
fi | |||
#end modification | |||
if [ $itask -gt $mtask ] | |||
then nitask=$minitask | |||
else nitask=$(($minitask + 1)) | |||
fi | |||
fitask=`expr $litask + 1` | |||
litask=`expr $litask + $nitask` | |||
if [ $nitask -lt $niba0 ] | |||
then n=$nitask | |||
else n=$niba0 | |||
fi | |||
if [ $n -lt 1 ] | |||
then n=1 | |||
fi | |||
nbatask=$(($nitask / $n)) | |||
nba=`expr $nba + $nbatask` | |||
image1=$(($fframe + $fitask - 1)) #id number of the first image | |||
if [ $maxcpu -gt 1 ] | |||
then echo "$image1 $nitask $itask $nbatask" | numactl --cpunodebind=$inode mintegrate_par & | |||
else echo "$image1 $nitask $itask $nbatask" | mintegrate & | |||
fi | |||
pids="$pids $!" #append id of the background process just started | |||
itask=`expr $itask + 1` | |||
done | |||
trap "kill -15 $pids" 2 15 # 2:Control-C; 15:kill | |||
wait #wait for all background processes issued by this shell | |||
rm -f mintegrate.tmp #this temporary file was generated by xds | |||
</pre> | |||
The scripts could be modified to use <tt>[https://github.com/RRZE-HPC/likwid/wiki likwid]</tt> instead of <tt>numactl</tt> which would allow for better control of affinity groups. Alternatively, one may use <tt>taskset</tt> or <tt>KMP_AFFINITY</tt>. |