Eiger: Difference between revisions

Eiger (view source)

Revision as of 15:48, 22 February 2017

1,706 bytes added , 22 February 2017

→‎Xeon Phi (Knights Landing, KNL)

Kay

Bureaucrats

2,652

edits

@@ Line 48: / Line 48: @@
 === Xeon Phi (Knights Landing, KNL) ===
-The benchmark was run on a single KNL7210 processor (256 cores) set to quadrant mode and using the MCDRAM as cache. The environment variable OMP_PROC_BIND was set to false (if this is not done, the scheduler seems to put all threads on one core). XDS was compiled with the -xMIC-AVX512 option of ifort. This gives
+The benchmark was run on a single KNL7210 processor (256 cores) set to quadrant mode and using the MCDRAM as cache. The environment variable OMP_PROC_BIND was set to false (if this is not done, the scheduler seems to put all threads on one core). XDS was compiled with the -xMIC-AVX512 option of ifort. These benchmarks were performed with "warm" operating system cache, which means that the first run of a given type didn't count because it had to read all data from disk.
+Deviating from the Xeon benchmark setup (above), BACKGROUND_RANGE was set to a more realistic value of 1 50 (instead of 1 9). The INIT numbers are therefore not comparable.
+This gives
   COLSPOT:         elapsed wall-clock time       48.3 sec
   INTEGRATE: total elapsed wall-clock time       61.2 sec
 when run with MAXIMUM_NUMBER_OF_JOBS=16 and MAXIMUM_NUMBER_OF_PROCESSORS=16. These parameters, as well as the KNL setup could still be optimized.
-Update Feb 21, 2017 using XDS BUILT=20161205
+Update Feb 21, 2017 using XDS BUILT=20161205, and the CentOS-7.3 default kernel 3.10.0-514.6.1.el7:
   INIT:            elapsed wall-clock time       33.4 sec
   COLSPOT:         elapsed wall-clock time       49.3 sec
   INTEGRATE: total elapsed wall-clock time       59.8 sec
-Now using Dectris' library (v. 20170215) with <code>LIB=/usr/local/lib64/dectris-neggia.so</code>:
+Using, instead of the H5ToXds script, a pre-release library that makes use of the <code>LIB=</code> [http://homes.mpimf-heidelberg.mpg.de/~kabsch/xds/html_doc/xds_parameters.html#LIB= option] of XDS:
   INIT:            elapsed wall-clock time       30.4 sec
   COLSPOT:         elapsed wall-clock time       40.7 sec
@@ Line 65: / Line 68: @@
   COLSPOT:         elapsed wall-clock time       40.0 sec
   INTEGRATE: total elapsed wall-clock time       51.3 sec
-This was running with a 8GB/8GB split MCDRAM. The same run, but with 8 JOBS and 32 PROCESSORS, takes
+This was running with a 8GB/8GB split (''hybrid'') MCDRAM. The same run, but with 8 JOBS and 32 PROCESSORS, takes
   INIT.LP:         elapsed wall-clock time       25.3 sec
   COLSPOT:         elapsed wall-clock time       40.1 sec
   INTEGRATE: total elapsed wall-clock time       53.1 sec
+Back to 16 JOBS and 16 PROCESSORS, but with MCDRAM in ''flat'' mode und <code>numactl --preferred=1 xds_par</code> (thus using all 16GB for arrays, and nothing for cache):
+ INIT.LP:         elapsed wall-clock time       29.5 sec
+ COLSPOT:         elapsed wall-clock time       38.6 sec
+ INTEGRATE: total elapsed wall-clock time       53.2 sec
+Now setting the KNL to SNC4 mode, and the MCDRAM to cache (using it in flat mode is impractical because the --preferred argument takes only 1 argument; to determine the correct argument requires scripting):
+ INIT.LP:         elapsed wall-clock time       29.6 sec
+ COLSPOT.LP:      elapsed wall-clock time       37.8 sec
+ INTEGRATE: total elapsed wall-clock time       49.6 sec
+Conclusions: since INIT benefits from more PROCESSORs, one could run XDS twice for fastest turnaround; the first run with JOBS=XYCORR INIT and a high number of processors (99 is maximum). The second run with JOB=COLSPOT IDXREF DEFPIX INTEGRATE CORRECT, and an optimized JOBS/PROCESSORS combination. The SNC4 mode is indeed fastest - to do better than the cache mode of the MCDRAM, one needs to adapt the forkcolspot and forkintegrate script- see [[Performance]].
-Conclusion: since INIT benefits from more PROCESSORs, one could run XDS twice for fastest turnaround; the first run with JOBS=XYCORR INIT and a high number of processors (99 is maximum). The second run with JOB=COLSPOT IDXREF DEFPIX INTEGRATE CORRECT, and an optimized JOBS/PROCESSORS combination.
+For comparison, if these data are stored as CBFs, COLSPOT and INTEGRATE take 34.8 and 45.2 seconds, respectively, in SNC4 mode. However, with a cold cache (i.e. when data are read for the first time), the HDF5 files have an advantage because they are a factor 2.5 smaller, due to the better compression.
 == Troubleshooting ==

Eiger: Difference between revisions

Eiger (view source)

Revision as of 15:48, 22 February 2017

Navigation menu

Search