| 1 | |
| 2 | | Copyright |
| 3 | | Markus Wittmann, 2016-2018 |
| 4 | | RRZE, University of Erlangen-Nuremberg, Germany |
| 5 | | markus.wittmann -at- fau.de or hpc -at- rrze.fau.de |
| 6 | | |
| 7 | | Viktor Haag, 2016 |
| 8 | | LSS, University of Erlangen-Nuremberg, Germany |
| 9 | | |
| 10 | | Michael Hussnaetter, 2017-2018 |
| 11 | | University of Erlangen-Nuremberg, Germany |
| 12 | | michael.hussnaetter -at- fau.de |
| 13 | | |
| 14 | | This file is part of the Lattice Boltzmann Benchmark Kernels (LbmBenchKernels). |
| 15 | | |
| 16 | | LbmBenchKernels is free software: you can redistribute it and/or modify |
| 17 | | it under the terms of the GNU General Public License as published by |
| 18 | | the Free Software Foundation, either version 3 of the License, or |
| 19 | | (at your option) any later version. |
| 20 | | |
| 21 | | LbmBenchKernels is distributed in the hope that it will be useful, |
| 22 | | but WITHOUT ANY WARRANTY; without even the implied warranty of |
| 23 | | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the |
| 24 | | GNU General Public License for more details. |
| 25 | | |
| 26 | | You should have received a copy of the GNU General Public License |
| 27 | | along with LbmBenchKernels. If not, see <http://www.gnu.org/licenses/>. |
| 28 | |
| 29 | .. title:: LBM Benchmark Kernels Documentation |
| 30 | |
| 31 | |
| 32 | **LBM Benchmark Kernels Documentation** |
| 33 | |
| 34 | .. sectnum:: |
| 35 | .. contents:: |
| 36 | |
| 37 | Introduction |
| 38 | ============ |
| 39 | |
| 40 | The lattice Boltzmann (LBM) benchmark kernels are a collection of LBM kernel |
| 41 | implementations. |
| 42 | |
| 43 | **AS SUCH THE LBM BENCHMARK KERNELS ARE NO FULLY EQUIPPED CFD SOLVER AND SOLELY |
| 44 | SERVES THE PURPOSE OF STUDYING POSSIBLE PERFORMANCE OPTIMIZATIONS AND/OR |
| 45 | EXPERIMENTS.** |
| 46 | |
| 47 | Currently all kernels utilize a D3Q19 discretization and the |
| 48 | two-relaxation-time (TRT) collision operator [ginzburg-2008]_. |
| 49 | All operations are carried out in double or single precision arithmetic. |
| 50 | |
| 51 | Compilation |
| 52 | =========== |
| 53 | |
| 54 | The benchmark framework currently supports only Linux systems and the GCC and |
| 55 | Intel compilers. Every other configuration probably requires adjustment inside |
| 56 | the code and the makefiles. Furthermore some code might be platform or at least |
| 57 | POSIX specific. |
| 58 | |
| 59 | The benchmark can be build via ``make`` from the ``src`` subdirectory. This will |
| 60 | generate one binary which hosts all implemented benchmark kernels. |
| 61 | |
| 62 | Binaries are located under the ``bin`` subdirectory and will have different names |
| 63 | depending on compiler and build configuration. |
| 64 | |
| 65 | Compilation can target debug or release builds. Combined with both build types |
| 66 | verification can be enabled, which increases the runtime and hence is not |
| 67 | suited for benchmarking. |
| 68 | |
| 69 | |
| 70 | Debug and Verification |
| 71 | ---------------------- |
| 72 | |
| 73 | :: |
| 74 | |
| 75 | make BUILD=debug BENCHMARK=off |
| 76 | |
| 77 | Running ``make`` with ``BUILD=debug`` builds the debug version of |
| 78 | the benchmark kernels, where no optimizations are performed, line numbers and |
| 79 | debug symbols are included as well as ``DEBUG`` will be defined. The resulting |
| 80 | binary will be found in the ``bin`` subdirectory and named |
| 81 | ``lbmbenchk-linux-<compiler>-debug``. |
| 82 | |
| 83 | Specifying ``BENCHMARK=off`` turns on verification |
| 84 | (``VERIFICATION=on``), statistics (``STATISTICS=on``), and VTK output |
| 85 | (``VTK_OUTPUT=on``) enabled. |
| 86 | |
| 87 | Please note that the generated binary will therefore |
| 88 | exhibit a poor performance. |
| 89 | |
| 90 | |
| 91 | Release and Verification |
| 92 | ------------------------ |
| 93 | |
| 94 | Verification with the debug builds can be extremely slow. Hence verification |
| 95 | capabilities can be build with release builds: :: |
| 96 | |
| 97 | make BENCHMARK=off |
| 98 | |
| 99 | |
| 100 | Benchmarking |
| 101 | ------------ |
| 102 | |
| 103 | To generate a binary for benchmarking run make with :: |
| 104 | |
| 105 | make |
| 106 | |
| 107 | As default ``BENCHMARK=on`` and ``BUILD=release`` is set, where |
| 108 | ``BUILD=release`` turns optimizations on and ``BENCHMARK=on`` disables |
| 109 | verfification, statistics, and VTK output. |
| 110 | |
| 111 | See Options Summary below for further description of options which can be |
| 112 | applied, e.g. TARCH as well as the Benchmarking section. |
| 113 | |
| 114 | Compilers |
| 115 | --------- |
| 116 | |
| 117 | Currently only the GCC and Intel compiler under Linux are supported. Between |
| 118 | both configuration can be chosen via ``CONFIG=linux-gcc`` or |
| 119 | ``CONFIG=linux-intel``. |
| 120 | |
| 121 | |
| 122 | Floating Point Precision |
| 123 | ------------------------ |
| 124 | |
| 125 | As default double precision data types are used for storing PDFs and floating |
| 126 | point constants. Furthermore, this is the default for the intrincis kernels. |
| 127 | With the ``PRECISION=sp`` variable this can be changed to single precision. :: |
| 128 | |
| 129 | make PRECISION=sp # build for single precision kernels |
| 130 | |
| 131 | make PRECISION=dp # build for double precision kernels (defalt) |
| 132 | |
| 133 | |
| 134 | Cleaning |
| 135 | -------- |
| 136 | |
| 137 | For each configuration and build (debug/release) a subdirectory under the |
| 138 | ``src/obj`` directory is created where the dependency and object files are |
| 139 | stored. |
| 140 | With :: |
| 141 | |
| 142 | make CONFIG=... BUILD=... clean |
| 143 | |
| 144 | a specific combination is select and cleaned, whereas with :: |
| 145 | |
| 146 | make clean-all |
| 147 | |
| 148 | all object and dependency files are deleted. |
| 149 | |
| 150 | |
| 151 | Options Summary |
| 152 | --------------- |
| 153 | |
| 154 | Options that can be specified when building the suite with make: |
| 155 | |
| 156 | ============= ======================= ============ ========================================================== |
| 157 | name values default description |
| 158 | ============= ======================= ============ ========================================================== |
| 159 | BENCHMARK on, off on If enabled, disables VERIFICATION, STATISTICS, VTK_OUTPUT. If disabled enables the three former options. |
| 160 | BUILD debug, release release debug: no optimization, debug symbols, DEBUG defined. release: optimizations enabled. |
| 161 | CONFIG linux-gcc, linux-intel linux-intel Select GCC or Intel compiler. |
| 162 | ISA avx512, avx, sse avx Determines which ISA extension is used for macro definitions of the intrinsics. This is *not* the architecture the compiler generates code for. |
| 163 | OPENMP on, off on OpenMP, i.e. threading support. |
| 164 | PRECISION dp, sp dp Floating point precision used for data type, arithmetic, and intrincics. |
| 165 | STATISTICS on, off off View statistics, like density etc, during simulation. |
| 166 | TARCH -- -- Via TARCH the architecture the compiler generates code for can be overridden. The value depends on the chosen compiler. |
| 167 | VERIFICATION on, off off Turn verification on/off. |
| 168 | VTK_OUTPUT on, off off Enable/Disable VTK file output. |
| 169 | ============= ======================= ============ ========================================================== |
| 170 | |
| 171 | **Suboptions for ``ISA=avx512``** |
| 172 | |
| 173 | ============================== ======== ======== ====================== |
| 174 | name values default description |
| 175 | ============================== ======== ======== ====================== |
| 176 | ADJ_LIST_MEM_TYPE HBM - Determines memory location of adjacency list array, DRAM or HBM. |
| 177 | PDF_MEM_TYPE HBM - Determines memory location of PDF array, DRAM or HBM. |
| 178 | SOFTWARE_PREFETCH_LOOKAHEAD_L1 int >= 0 0 Software prefetch lookahead of elements into L1 cache, value is multiplied by vector size (``VSIZE``). |
| 179 | SOFTWARE_PREFETCH_LOOKAHEAD_L2 int >= 0 0 Software prefetch lookahead of elements into L2 cache, value is multiplied by vector size (``VSIZE``). |
| 180 | ============================== ======== ======== ====================== |
| 181 | |
| 182 | Please note this options require AVX-512 PF support of the target processor. |
| 183 | |
| 184 | Invocation |
| 185 | ========== |
| 186 | |
| 187 | Running the binary will print among the GPL licence header a line like the following: :: |
| 188 | |
| 189 | LBM Benchmark Kernels 0.1, compiled Jul 5 2017 21:59:22, type: verification |
| 190 | |
| 191 | if verfication was enabled during compilation or :: |
| 192 | |
| 193 | LBM Benchmark Kernels 0.1, compiled Jul 5 2017 21:59:22, type: benchmark |
| 194 | |
| 195 | if verfication was disabled during compilation. |
| 196 | |
| 197 | Command Line Parameters |
| 198 | ----------------------- |
| 199 | |
| 200 | Running the binary with ``-h`` list all available parameters: :: |
| 201 | |
| 202 | Usage: |
| 203 | ./lbmbenchk -list |
| 204 | ./lbmbenchk |
| 205 | [-dims XxYxZ] [-geometry box|channel|pipe|blocks[-<block size>]] [-iterations <iterations>] [-lattice-dump-ascii] |
| 206 | [-rho-in <density>] [-rho-out <density] [-omega <omega>] [-kernel <kernel>] |
| 207 | [-periodic-x] |
| 208 | [-t <number of threads>] |
| 209 | [-pin core{,core}*] |
| 210 | [-verify] |
| 211 | -- <kernel specific parameters> |
| 212 | |
| 213 | -list List available kernels. |
| 214 | |
| 215 | -dims XxYxZ Specify geometry dimensions. |
| 216 | |
| 217 | -geometry blocks-<block size> |
| 218 | Geometetry with blocks of size <block size> regularily layout out. |
| 219 | |
| 220 | |
| 221 | If an option is specified multiple times the last one overrides previous ones. |
| 222 | This holds also true for ``-verify`` which sets geometry dimensions, |
| 223 | iterations, etc, which can afterward be override, e.g.: :: |
| 224 | |
| 225 | $ bin/lbmbenchk-linux-intel-release-dp -verfiy -dims 32x32x32 |
| 226 | |
| 227 | Kernel specific parameters can be obtained via selecting the specific kernel |
| 228 | and passing ``-h`` as parameter: :: |
| 229 | |
| 230 | $ bin/lbmbenchk-linux-intel-release-dp -kernel kernel-name -- -h |
| 231 | ... |
| 232 | Kernel parameters: |
| 233 | [-blk <n>] [-blk-[xyz] <n>] |
| 234 | |
| 235 | |
| 236 | A list of all available kernels can be obtained via ``-list``: :: |
| 237 | |
| 238 | $ ../bin/lbmbenchk-linux-gcc-debug-dp -list |
| 239 | Lattice Boltzmann Benchmark Kernels (LbmBenchKernels) Copyright (C) 2016, 2017 LSS, RRZE |
| 240 | This program comes with ABSOLUTELY NO WARRANTY; for details see LICENSE. |
| 241 | This is free software, and you are welcome to redistribute it under certain conditions. |
| 242 | |
| 243 | LBM Benchmark Kernels 0.1, compiled Jul 5 2017 21:59:22, type: verification |
| 244 | Available kernels to benchmark: |
| 245 | list-aa-pv-soa |
| 246 | list-aa-ria-soa |
| 247 | list-aa-soa |
| 248 | list-aa-aos |
| 249 | list-pull-split-nt-1s-soa |
| 250 | list-pull-split-nt-2s-soa |
| 251 | list-push-soa |
| 252 | list-push-aos |
| 253 | list-pull-soa |
| 254 | list-pull-aos |
| 255 | push-soa |
| 256 | push-aos |
| 257 | pull-soa |
| 258 | pull-aos |
| 259 | blk-push-soa |
| 260 | blk-push-aos |
| 261 | blk-pull-soa |
| 262 | blk-pull-aos |
| 263 | |
| 264 | Kernels |
| 265 | ------- |
| 266 | |
| 267 | The following list shortly describes available kernels: |
| 268 | |
| 269 | - **push-soa/push-aos/pull-soa/pull-aos**: |
| 270 | Unoptimized kernels (but stream/collide are already fused) using two grids as |
| 271 | source and destination. Implement push/pull semantics as well structure of |
| 272 | arrays (soa) or array of structures (aos) layout. |
| 273 | |
| 274 | - **blk-push-soa/blk-push-aos/blk-pull-soa/blk-pull-aos**: |
| 275 | The same as the unoptimized kernels without the blk prefix, except that they support |
| 276 | spatial blocking, i.e. loop blocking of the three loops used to iterate over |
| 277 | the lattice. Here manual work sharing for OpenMP is used. |
| 278 | |
| 279 | - **aa-aos/aa-soa**: |
| 280 | Straight forward implementation of AA pattern on full array with blocking support. |
| 281 | Manual work sharing for OpenMP is used. Domain is partitioned only along the x dimension. |
| 282 | |
| 283 | - **aa-vec-soa/aa-vec-sl-soa**: |
| 284 | Optimized AA kernel with intrinsics on full array. aa-vec-sl-soa uses only |
| 285 | one loop for iterating over the lattice instead of three nested ones. |
| 286 | |
| 287 | - **list-push-soa/list-push-aos/list-pull-soa/list-pull-aos**: |
| 288 | The same as the unoptimized kernels without the list prefix, but for indirect addressing. |
| 289 | Here only a 1D vector of is used to store the fluid nodes, omitting the |
| 290 | obstacles. An adjacency list is used to recover the neighborhood associations. |
| 291 | |
| 292 | - **list-pull-split-nt-1s-soa/list-pull-split-nt-2s-soa**: |
| 293 | Optimized variant of list-pull-soa. Chunks of the lattice are processed as |
| 294 | once. Postcollision values are written back via nontemporal stores in 18 (1s) |
| 295 | or 9 (2s) loops. |
| 296 | |
| 297 | - **list-aa-aos/list-aa-soa**: |
| 298 | Unoptimized implementation of the AA pattern for the 1D vector with adjacency |
| 299 | list. Supported are array of structures (aos) and structure of arrays (soa) |
| 300 | data layout is supported. |
| 301 | |
| 302 | - **list-aa-ria-soa**: |
| 303 | Implementation of AA pattern with intrinsics for the 1D vector with adjacency |
| 304 | list. Furthermore it contains a vectorized even time step and run length |
| 305 | coding to reduce the loop balance of the odd time step. |
| 306 | |
| 307 | - **list-aa-pv-soa**: |
| 308 | All optimizations of list-aa-ria-soa. Additional with partial vectorization |
| 309 | of the odd time step. |
| 310 | |
| 311 | |
| 312 | Note that all array of structures (aos) kernels might require blocking |
| 313 | (depending on the domain size) to reach the performance of their structure of |
| 314 | arrays (soa) counter parts. |
| 315 | |
| 316 | The following table summarizes the properties of the kernels. Here **D** means |
| 317 | direct addressing, i.e. full array, **I** means indirect addressing, i.e. 1D |
| 318 | vector with adjacency list, **x** means supported, whereas **--** means unsupported. |
| 319 | The loop balance B_l is computed for D3Q19 model with **double precision** floating |
| 320 | point for PDFs (8 byte) and 4 byte integers for the index (adjacency list). |
| 321 | As list-aa-ria-soa and list-aa-pv-soa support run length coding their effective |
| 322 | loop balance depends on the geometry. The effective loop balance is printed |
| 323 | during each run. |
| 324 | |
| 325 | |
| 326 | ====================== =========== =========== ===== ======== ======== ============ |
| 327 | kernel name prop. step data layout addr. parallel blocking B_l [B/FLUP] |
| 328 | ====================== =========== =========== ===== ======== ======== ============ |
| 329 | push-soa OS SoA D x -- 456 |
| 330 | push-aos OS AoS D x -- 456 |
| 331 | pull-soa OS SoA D x -- 456 |
| 332 | pull-aos OS AoS D x -- 456 |
| 333 | blk-push-soa OS SoA D x x 456 |
| 334 | blk-push-aos OS AoS D x x 456 |
| 335 | blk-pull-soa OS SoA D x x 456 |
| 336 | blk-pull-aos OS AoS D x x 456 |
| 337 | aa-soa AA SoA D x x 304 |
| 338 | aa-aos AA AoS D x x 304 |
| 339 | aa-vec-soa AA SoA D x x 304 |
| 340 | aa-vec-sl-soa AA SoA D x x 304 |
| 341 | list-push-soa OS SoA I x x 528 |
| 342 | list-push-aos OS AoS I x x 528 |
| 343 | list-pull-soa OS SoA I x x 528 |
| 344 | list-pull-aos OS AoS I x x 528 |
| 345 | list-pull-split-nt-1s OS SoA I x x 376 |
| 346 | list-pull-split-nt-2s OS SoA I x x 376 |
| 347 | list-aa-soa AA SoA I x x 340 |
| 348 | list-aa-aos AA AoS I x x 340 |
| 349 | list-aa-ria-soa AA SoA I x x 304-342 |
| 350 | list-aa-pv-soa AA SoA I x x 304-342 |
| 351 | ====================== =========== =========== ===== ======== ======== ============ |
| 352 | |
| 353 | Benchmarking |
| 354 | ============ |
| 355 | |
| 356 | Correct benchmarking is a nontrivial task. Whenever benchmark results should be |
| 357 | created make sure the binary was compiled with: |
| 358 | |
| 359 | - ``BENCHMARK=on`` (default if not overriden) and |
| 360 | - ``BUILD=release`` (default if not overriden) and |
| 361 | - the correct ISA for macros (i.e. intrinsics) is used, selected via ``ISA`` and |
| 362 | - use ``TARCH`` to specify the architecture the compiler generates code for. |
| 363 | |
| 364 | Intel Compiler |
| 365 | -------------- |
| 366 | |
| 367 | For the Intel compiler one can specify depending on the target ISA extension: |
| 368 | |
| 369 | - SSE: ``TARCH=-xSSE4.2`` |
| 370 | - AVX: ``TARCH=-xAVX`` |
| 371 | - AVX2 and FMA: ``TARCH=-xCORE-AVX2,-fma`` |
| 372 | - AVX512: ``TARCH=-xCORE-AVX512`` |
| 373 | - KNL: ``TARCH=-xMIC-AVX512`` |
| 374 | |
| 375 | Compiling for an architecture supporting AVX (Sandy Bridge, Ivy Bridge): :: |
| 376 | |
| 377 | make ISA=avx TARCH=-xAVX |
| 378 | |
| 379 | |
| 380 | Compiling for an architecture supporting AVX2 (Haswell, Broadwell): :: |
| 381 | |
| 382 | make ISA=avx TARCH=-xCORE-AVX2,-fma |
| 383 | |
| 384 | WARNING: ISA is here still set to ``avx`` as currently we have the FMA intrinsics not |
| 385 | implemented. This might change in the future. |
| 386 | |
| 387 | |
| 388 | .. TODO: add isa=avx512 and add docu for knl |
| 389 | |
| 390 | .. TODO: kein prefetching wenn AVX-512 PF nicht unterstuetz wird |
| 391 | |
| 392 | Compiling for an architecture supporting AVX-512 (Skylake): :: |
| 393 | |
| 394 | make ISA=avx512 TARCH=-xCORE-AVX512 |
| 395 | |
| 396 | Please note that for the AVX512 gather kernels software prefetching for the |
| 397 | gather instructions is disabled per default. |
| 398 | To enable it set ``SOFTWARE_PREFETCH_LOOKAHEAD_L1`` and/or |
| 399 | ``SOFTWARE_PREFETCH_LOOKAHEAD_L2`` to a value greater than ``0`` during |
| 400 | compilation. Note that this requires AVX-512 PF support from the target |
| 401 | processor. |
| 402 | |
| 403 | Compiling for MIC architecture KNL supporting AVX-512 and AVX-512 PF:: |
| 404 | |
| 405 | make ISA=avx512 TARCH=-xMIC-AVX512 |
| 406 | |
| 407 | or optionally with software prefetch enabled:: |
| 408 | |
| 409 | make ISA=avx512 TARCH=-xMIC-AVX512 SOFTWARE_PREFETCH_LOOKAHEAD_L1=<value> SOFTWARE_PREFETCH_LOOKAHEAD_L2=<value> |
| 410 | |
| 411 | |
| 412 | |
| 413 | |
| 414 | Pinning |
| 415 | ------- |
| 416 | |
| 417 | During benchmarking pinning should be used via the ``-pin`` parameter. Running |
| 418 | a benchmark with 10 threads and pin them to the first 10 cores works like :: |
| 419 | |
| 420 | $ bin/lbmbenchk-linux-intel-release-dp ... -t 10 -pin $(seq -s , 0 9) |
| 421 | |
| 422 | |
| 423 | General Remarks |
| 424 | --------------- |
| 425 | |
| 426 | Things the binary does nor check or control: |
| 427 | |
| 428 | - transparent huge pages: when allocating memory small 4 KiB pages might be |
| 429 | replaced with larger ones. This is in general a good thing, but if this is |
| 430 | really the case, depends on the system settings (check e.g. the status of |
| 431 | ``/sys/kernel/mm/transparent_hugepage/enabled``). |
| 432 | Currently ``madvise(MADV_HUGEPAGE)`` is used for allocations which are aligned to |
| 433 | a 4 KiB page, which should be the case for the lattices. |
| 434 | This should result in huge pages except THP is disabled on the machine. |
| 435 | (NOTE: madvise() is used if ``HAVE_HUGE_PAGES`` is defined, which is currently |
| 436 | hard coded defined in ``Memory.c``). |
| 437 | |
| 438 | - CPU/core frequency: For reproducible results the frequency of all cores |
| 439 | should be fixed. |
| 440 | |
| 441 | - NUMA placement policy: The benchmark assumes a first touch policy, which |
| 442 | means the memory will be placed at the NUMA domain the touching core is |
| 443 | associated with. If a different policy is in place or the NUMA domain to be |
| 444 | used is already full memory might be allocated in a remote domain. Accesses |
| 445 | to remote domains typically have a higher latency and lower bandwidth. |
| 446 | |
| 447 | - System load: interference with other application, especially on desktop |
| 448 | systems should be avoided. |
| 449 | |
| 450 | - Padding: For SoA based kernels the number of (fluid) nodes is automatically |
| 451 | adjusted so that no cache or TLB thrashing should occur. The parameters are |
| 452 | optimized for current Intel based systems. For more details look into the |
| 453 | padding section. |
| 454 | |
| 455 | - CPU dispatcher function: the compiler might add different versions of a |
| 456 | function for different ISA extensions. Make sure the code you might think is |
| 457 | executed is actually the code which is executed. |
| 458 | |
| 459 | Padding |
| 460 | ------- |
| 461 | |
| 462 | With correct padding cache and TLB thrashing can be avoided. Therefore the |
| 463 | number of (fluid) nodes used in the data layout is artificially increased. |
| 464 | |
| 465 | Currently automatic padding is active for kernels which support it. It can be |
| 466 | controlled via the kernel parameter (i.e. parameter after the ``--``) |
| 467 | ``-pad``. Supported values are ``auto`` (default), ``no`` (to disable padding), |
| 468 | or a manual padding. |
| 469 | |
| 470 | Automatic padding tries to avoid cache and TLB thrashing and pads for a 32 |
| 471 | entry (huge pages) TLB with 8 sets and a 512 set (L2) cache. This reflects the |
| 472 | parameters of current Intel based processors. |
| 473 | |
| 474 | Manual padding is done via a padding string and has the format |
| 475 | ``mod_1+offset_1(,mod_n+offset_n)``, which specifies numbers of bytes. |
| 476 | SoA data layouts can exhibit TLB thrashing. Therefore we want to distribute the |
| 477 | 19 pages with one lattice (36 with two lattices) we are concurrently accessing |
| 478 | over as much sets in the TLB as possible. |
| 479 | This is controlled by the distance between the accessed pages, which is the |
| 480 | number of (fluid) nodes in between them and can be adjusted by adding further |
| 481 | (fluid) nodes. |
| 482 | We want the distance d (in bytes) between two accessed pages to be e.g. |
| 483 | **d % (PAGE_SIZE * TLB_SETS) = PAGE_SIZE**. |
| 484 | This would distribute the pages evenly over the sets. Hereby **PAGE_SIZE * TLB_SETS** |
| 485 | would be our ``mod_1`` and **PAGE_SIZE** (after the =) our ``offset_1``. |
| 486 | Measurements show that with only a quarter of half of a page size as offset |
| 487 | higher performance is achieved, which is done by automatic padding. |
| 488 | On top of this padding more paddings can be added. They are just added to the |
| 489 | padding string and are separated by commas. |
| 490 | |
| 491 | A zero modulus in the padding string has a special meaning. Here the |
| 492 | corresponding offset is just added to the number of nodes. A padding string |
| 493 | like ``-pad 0+16`` would at a static padding of two nodes (one node = 8 b). |
| 494 | |
| 495 | |
| 496 | Geometries |
| 497 | ========== |
| 498 | |
| 499 | TODO: supported geometries: channel, pipe, blocks, fluid |
| 500 | |
| 501 | |
| 502 | Performance Results |
| 503 | =================== |
| 504 | |
| 505 | The sections lists performance values measured on several machines for |
| 506 | different kernels and geometries and **double precision** floating point data/arithmetic. |
| 507 | The **RFM** column denotes the expected performance as predicted by the |
| 508 | Roofline performance model [williams-2008]_. |
| 509 | For performance prediction of each kernel a memory bandwidth benchmark is used |
| 510 | which mimics the kernels memory access pattern and the kernel's loop balance |
| 511 | (see [kernels]_ for details). |
| 512 | |
| 513 | Machine Specifications |
| 514 | ---------------------- |
| 515 | |
| 516 | **Ivy Bridge, Intel Xeon E5-2660 v2** |
| 517 | |
| 518 | - Ivy Bridge architecture, AVX |
| 519 | - 10 cores, 2.2 GHz |
| 520 | - SMT enabled |
| 521 | - memoy bandwidth: |
| 522 | |
| 523 | - copy-19 32.7 GB/s |
| 524 | - copy-19-nt-sl 35.6 GB/s |
| 525 | - update-19 37.4 GB/s |
| 526 | |
| 527 | **Haswell, Intel Xeon E5-2695 v3** |
| 528 | |
| 529 | - Haswell architecture, AVX2, FMA |
| 530 | - 14 cores, 2.3 GHz |
| 531 | - 2 x 7 cores in cluster-on-die (CoD) mode enabled |
| 532 | - SMT enabled |
| 533 | - memory bandwidth: |
| 534 | |
| 535 | - copy-19 47.3 GB/s |
| 536 | - copy-19-nt-sl 47.1 GB/s |
| 537 | - update-19 44.0 GB/s |
| 538 | |
| 539 | |
| 540 | **Broadwell, Intel Xeon E5-2630 v4** |
| 541 | |
| 542 | - Broadwell architecture, AVX2, FMA |
| 543 | - 10 cores, 2.2 GHz |
| 544 | - SMT disabled |
| 545 | - memory bandwidth: |
| 546 | |
| 547 | - copy-19 48.0 GB/s |
| 548 | - copy-nt-sl-19 48.2 GB/s |
| 549 | - update-19 51.1 GB/s |
| 550 | |
| 551 | **Skylake, Intel Xeon Gold 6148** |
| 552 | |
| 553 | NOTE: currently we only use AVX2 intrinsics. |
| 554 | |
| 555 | - Skylake server architecture, AVX2, AVX512, 2 FMA units |
| 556 | - 20 cores, 2.4 GHz |
| 557 | - SMT enabled |
| 558 | - memory bandwidth: |
| 559 | |
| 560 | - copy-19 89.7 GB/s |
| 561 | - copy-19-nt-sl 92.4 GB/s |
| 562 | - update-19 93.6 GB/s |
| 563 | |
| 564 | **Zen, AMD EPYC 7451** |
| 565 | |
| 566 | - Zen architecture, AVX2, FMA |
| 567 | - 24 cores, 2.3 GHz |
| 568 | - SMT enabled |
| 569 | - memory bandwidth: |
| 570 | |
| 571 | - copy-19 111.9 GB/s |
| 572 | - copy-19-nt-sl 111.7 GB/s |
| 573 | - update-19 109.2 GB/s |
| 574 | |
| 575 | **Zen, AMD Ryzen 7 1700X** |
| 576 | |
| 577 | - Zen architecture, AVX2, FMA |
| 578 | - 8 cores, 3.4 GHz |
| 579 | - SMT enabled |
| 580 | - memory bandwidth: |
| 581 | |
| 582 | - copy-19 27.2 GB/s |
| 583 | - copy-19-nt-sl 27.1 GB/s |
| 584 | - update-19 26.1 GB/s |
| 585 | |
| 586 | Single Socket Results |
| 587 | --------------------- |
| 588 | |
| 589 | - Geometry dimensions are for all measurements 500x100x100 nodes. |
| 590 | - Note the **different scaling on the y axis** of the plots! |
| 591 | |
| 592 | .. |perf_emmy_dp| image:: images/benchmark-emmy-dp.png |
| 593 | :scale: 50 % |
| 594 | .. |perf_emmy_sp| image:: images/benchmark-emmy-sp.png |
| 595 | :scale: 50 % |
| 596 | .. |perf_hasep1_dp| image:: images/benchmark-hasep1-dp.png |
| 597 | :scale: 50 % |
| 598 | .. |perf_hasep1_sp| image:: images/benchmark-hasep1-sp.png |
| 599 | :scale: 50 % |
| 600 | .. |perf_meggie_dp| image:: images/benchmark-meggie-dp.png |
| 601 | :scale: 50 % |
| 602 | .. |perf_meggie_sp| image:: images/benchmark-meggie-sp.png |
| 603 | :scale: 50 % |
| 604 | .. |perf_skylakesp2_dp| image:: images/benchmark-skylakesp2-dp.png |
| 605 | :scale: 50 % |
| 606 | .. |perf_skylakesp2_sp| image:: images/benchmark-skylakesp2-sp.png |
| 607 | :scale: 50 % |
| 608 | .. |perf_summitridge1_dp| image:: images/benchmark-summitridge1-dp.png |
| 609 | :scale: 50 % |
| 610 | .. |perf_summitridge1_sp| image:: images/benchmark-summitridge1-sp.png |
| 611 | :scale: 50 % |
| 612 | .. |perf_naples1_dp| image:: images/benchmark-naples1-dp.png |
| 613 | :scale: 50 % |
| 614 | .. |perf_naples1_sp| image:: images/benchmark-naples1-sp.png |
| 615 | :scale: 50 % |
| 616 | |
| 617 | .. list-table:: |
| 618 | |
| 619 | * - Ivy Bridge, Intel Xeon E5-2660 v2, Double Precision |
| 620 | * - |perf_emmy_dp| |
| 621 | * - Ivy Bridge, Intel Xeon E5-2660 v2, Single Precision |
| 622 | * - |perf_emmy_sp| |
| 623 | * - Haswell, Intel Xeon E5-2695 v3, Double Precision |
| 624 | * - |perf_hasep1_dp| |
| 625 | * - Haswell, Intel Xeon E5-2695 v3, Single Precision |
| 626 | * - |perf_hasep1_sp| |
| 627 | * - Broadwell, Intel Xeon E5-2630 v4, Double Precision |
| 628 | * - |perf_meggie_dp| |
| 629 | * - Broadwell, Intel Xeon E5-2630 v4, Single Precision |
| 630 | * - |perf_meggie_sp| |
| 631 | * - Skylake, Intel Xeon Gold 6148, Double Precision, **NOTE: currently we only use AVX2 intrinsics.** |
| 632 | * - |perf_skylakesp2_dp| |
| 633 | * - Skylake, Intel Xeon Gold 6148, Single Precision, **NOTE: currently we only use AVX2 intrinsics.** |
| 634 | * - |perf_skylakesp2_sp| |
| 635 | * - Zen, AMD Ryzen 7 1700X, Double Precision |
| 636 | * - |perf_summitridge1_dp| |
| 637 | * - Zen, AMD Ryzen 7 1700X, Single Precision |
| 638 | * - |perf_summitridge1_sp| |
| 639 | * - Zen, AMD EPYC 7451, Double Precision |
| 640 | * - |perf_naples1_dp| |
| 641 | * - Zen, AMD EPYC 7451, Single Precision |
| 642 | * - |perf_naples1_sp| |
| 643 | |
| 644 | |
| 645 | Licence |
| 646 | ======= |
| 647 | |
| 648 | The Lattice Boltzmann Benchmark Kernels are licensed under GPLv3. |
| 649 | |
| 650 | |
| 651 | Acknowledgements |
| 652 | ================ |
| 653 | |
| 654 | This work was funded by BMBF, grant no. 01IH15003A (project SKAMPY). |
| 655 | |
| 656 | This work was funded by KONWHIR project OMI4PAPS. |
| 657 | |
| 658 | |
| 659 | Bibliography |
| 660 | ============ |
| 661 | |
| 662 | .. [ginzburg-2008] |
| 663 | I. Ginzburg, F. Verhaeghe, and D. d'Humières. |
| 664 | Two-relaxation-time lattice Boltzmann scheme: About parametrization, velocity, pressure and mixed boundary conditions. |
| 665 | Commun. Comput. Phys., 3(2):427-478, 2008. |
| 666 | |
| 667 | .. [williams-2008] |
| 668 | S. Williams, A. Waterman, and D. Patterson. |
| 669 | Roofline: an insightful visual performance model for multicore architectures. |
| 670 | Commun. ACM, 52(4):65-76, Apr 2009. doi:10.1145/1498765.1498785 |
| 671 | |
| 672 | |
| 673 | .. |datetime| date:: %Y-%m-%d %H:%M |
| 674 | |
| 675 | Document was generated at |datetime|. |
| 676 | |