doc/main.rst

   1
   2 | Copyright
   3 |   Markus Wittmann, 2016-2018
   4 |   RRZE, University of Erlangen-Nuremberg, Germany
   5 |   markus.wittmann -at- fau.de or hpc -at- rrze.fau.de
   6 |
   7 |   Viktor Haag, 2016
   8 |   LSS, University of Erlangen-Nuremberg, Germany
   9 |
  10 | This file is part of the Lattice Boltzmann Benchmark Kernels (LbmBenchKernels).
  11 |
  12 | LbmBenchKernels is free software: you can redistribute it and/or modify
  13 | it under the terms of the GNU General Public License as published by
  14 | the Free Software Foundation, either version 3 of the License, or
  15 | (at your option) any later version.
  16 |
  17 | LbmBenchKernels is distributed in the hope that it will be useful,
  18 | but WITHOUT ANY WARRANTY; without even the implied warranty of
  19 | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
  20 | GNU General Public License for more details.
  21 |
  22 | You should have received a copy of the GNU General Public License
  23 | along with LbmBenchKernels.  If not, see <http://www.gnu.org/licenses/>.
  24
  25 .. title:: LBM Benchmark Kernels Documentation
  26
  27
  28 **LBM Benchmark Kernels Documentation**
  29
  30 .. sectnum::
  31 .. contents::
  32
  33 Introduction
  34 ============
  35
  36 The lattice Boltzmann (LBM) benchmark kernels are a collection of LBM kernel
  37 implementations.
  38
  39 **AS SUCH THE LBM BENCHMARK KERNELS ARE NO FULLY EQUIPPED CFD SOLVER AND SOLELY
  40 SERVES THE PURPOSE OF STUDYING POSSIBLE PERFORMANCE OPTIMIZATIONS AND/OR
  41 EXPERIMENTS.**
  42
  43 Currently all kernels utilize a D3Q19 discretization and the
  44 two-relaxation-time (TRT) collision operator [ginzburg-2008]_.
  45 All operations are carried out in double or single precision arithmetic.
  46
  47 Compilation
  48 ===========
  49
  50 The benchmark framework currently supports only Linux systems and the GCC and
  51 Intel compilers. Every other configuration probably requires adjustment inside
  52 the code and the makefiles. Furthermore some code might be platform or at least
  53 POSIX specific.
  54
  55 The benchmark can be build via ``make`` from the ``src`` subdirectory. This will
  56 generate one binary which hosts all implemented benchmark kernels.
  57
  58 Binaries are located under the ``bin`` subdirectory and will have different names
  59 depending on compiler and build configuration.
  60
  61 Compilation can target debug or release builds. Combined with both build types
  62 verification can be enabled, which increases the runtime and hence is not
  63 suited for benchmarking.
  64
  65
  66 Debug and Verification
  67 ----------------------
  68
  69 ::
  70
  71   make BUILD=debug BENCHMARK=off
  72
  73 Running ``make`` with ``BUILD=debug`` builds the debug version of
  74 the benchmark kernels, where no optimizations are performed,  line numbers and
  75 debug symbols are included as well as ``DEBUG`` will be defined.  The resulting
  76 binary will be found in the ``bin`` subdirectory and named
  77 ``lbmbenchk-linux-<compiler>-debug``.
  78
  79 Specifying ``BENCHMARK=off`` turns on verification
  80 (``VERIFICATION=on``), statistics (``STATISTICS=on``), and VTK output
  81 (``VTK_OUTPUT=on``) enabled.
  82
  83 Please note that the generated binary will therefore
  84 exhibit a poor performance.
  85
  86
  87 Release and Verification
  88 ------------------------
  89
  90 Verification with the debug builds can be extremely slow. Hence verification
  91 capabilities can be build with release builds: ::
  92
  93   make BENCHMARK=off
  94
  95
  96 Benchmarking
  97 ------------
  98
  99 To generate a binary for benchmarking run make with ::
 100
 101   make
 102
 103 As default ``BENCHMARK=on`` and ``BUILD=release`` is set, where
 104 ``BUILD=release`` turns optimizations on and ``BENCHMARK=on`` disables
 105 verfification, statistics, and VTK output.
 106
 107 See Options Summary below for further description of options which can be
 108 applied, e.g. TARCH as well as the Benchmarking section.
 109
 110 Compilers
 111 ---------
 112
 113 Currently only the GCC and Intel compiler under Linux are supported. Between
 114 both configuration can be chosen via ``CONFIG=linux-gcc`` or
 115 ``CONFIG=linux-intel``.
 116
 117
 118 Floating Point Precision
 119 ------------------------
 120
 121 As default double precision data types are used for storing PDFs and floating
 122 point constants. Furthermore, this is the default for the intrincis kernels.
 123 With the ``PRECISION=sp`` variable this can be changed to single precision. ::
 124
 125   make PRECISION=sp   # build for single precision kernels
 126
 127   make PRECISION=dp   # build for double precision kernels (defalt)
 128
 129
 130 Cleaning
 131 --------
 132
 133 For each configuration and build (debug/release) a subdirectory under the
 134 ``src/obj`` directory is created where the dependency and object files are
 135 stored.
 136 With ::
 137
 138   make CONFIG=... BUILD=... clean
 139
 140 a specific combination is select and cleaned, whereas with ::
 141
 142   make clean-all
 143
 144 all object and dependency files are deleted.
 145
 146
 147 Options Summary
 148 ---------------
 149
 150 Options that can be specified when building the suite with make:
 151
 152 ============= ======================= ============ ==========================================================
 153 name          values                  default      description
 154 ============= ======================= ============ ==========================================================
 155 BENCHMARK     on, off                 on           If enabled, disables VERIFICATION, STATISTICS, VTK_OUTPUT. If disabled enables the three former options.
 156 BUILD         debug, release          release      debug: no optimization, debug symbols, DEBUG defined. release: optimizations enabled.
 157 CONFIG        linux-gcc, linux-intel  linux-intel  Select GCC or Intel compiler.
 158 ISA           avx, sse                avx          Determines which ISA extension is used for macro definitions of the intrinsics. This is *not* the architecture the compiler generates code for.
 159 OPENMP        on, off                 on           OpenMP, i.\,e.\. threading support.
 160 PRECISION     dp, sp                  dp           Floating point precision used for data type, arithmetic, and intrincics.
 161 STATISTICS    on, off                 off          View statistics, like density etc, during simulation.
 162 TARCH         --                      --           Via TARCH the architecture the compiler generates code for can be overridden. The value depends on the chosen compiler.
 163 VERIFICATION  on, off                 off          Turn verification on/off.
 164 VTK_OUTPUT    on, off                 off          Enable/Disable VTK file output.
 165 ============= ======================= ============ ==========================================================
 166
 167 Invocation
 168 ==========
 169
 170 Running the binary will print among the GPL licence header a line like the following: ::
 171
 172   LBM Benchmark Kernels 0.1, compiled Jul  5 2017 21:59:22, type: verification
 173
 174 if verfication was enabled during compilation or ::
 175
 176   LBM Benchmark Kernels 0.1, compiled Jul  5 2017 21:59:22, type: benchmark
 177
 178 if verfication was disabled during compilation.
 179
 180 Command Line Parameters
 181 -----------------------
 182
 183 Running the binary with ``-h`` list all available parameters: ::
 184
 185   Usage:
 186   ./lbmbenchk -list
 187   ./lbmbenchk
 188       [-dims XxYyZ] [-geometry box|channel|pipe|blocks[-<block size>]] [-iterations <iterations>] [-lattice-dump-ascii]
 189       [-rho-in <density>] [-rho-out <density] [-omega <omega>] [-kernel <kernel>]
 190       [-periodic-x]
 191       [-t <number of threads>]
 192       [-pin core{,core}*]
 193       [-verify]
 194       -- <kernel specific parameters>
 195
 196   -list           List available kernels.
 197
 198   -dims XxYxZ     Specify geometry dimensions.
 199
 200   -geometry blocks-<block size>
 201                   Geometetry with blocks of size <block size> regularily layout out.
 202
 203
 204 If an option is specified multiple times the last one overrides previous ones.
 205 This holds also true for ``-verify`` which sets geometry dimensions,
 206 iterations, etc, which can afterward be override, e.g.: ::
 207
 208   $ bin/lbmbenchk-linux-intel-release-dp -verfiy -dims 32x32x32
 209
 210 Kernel specific parameters can be obtained via selecting the specific kernel
 211 and passing ``-h`` as parameter: ::
 212
 213   $ bin/lbmbenchk-linux-intel-release-dp -kernel kernel-name -- -h
 214   ...
 215   Kernel parameters:
 216   [-blk <n>] [-blk-[xyz] <n>]
 217
 218
 219 A list of all available kernels can be obtained via ``-list``: ::
 220
 221   $ ../bin/lbmbenchk-linux-gcc-debug-dp -list
 222   Lattice Boltzmann Benchmark Kernels (LbmBenchKernels) Copyright (C) 2016, 2017 LSS, RRZE
 223   This program comes with ABSOLUTELY NO WARRANTY; for details see LICENSE.
 224   This is free software, and you are welcome to redistribute it under certain conditions.
 225
 226   LBM Benchmark Kernels 0.1, compiled Jul  5 2017 21:59:22, type: verification
 227   Available kernels to benchmark:
 228      list-aa-pv-soa
 229      list-aa-ria-soa
 230      list-aa-soa
 231      list-aa-aos
 232      list-pull-split-nt-1s-soa
 233      list-pull-split-nt-2s-soa
 234      list-push-soa
 235      list-push-aos
 236      list-pull-soa
 237      list-pull-aos
 238      push-soa
 239      push-aos
 240      pull-soa
 241      pull-aos
 242      blk-push-soa
 243      blk-push-aos
 244      blk-pull-soa
 245      blk-pull-aos
 246
 247 Kernels
 248 -------
 249
 250 The following list shortly describes available kernels:
 251
 252 - **push-soa/push-aos/pull-soa/pull-aos**:
 253   Unoptimized kernels (but stream/collide are already fused) using two grids as
 254   source and destination. Implement push/pull semantics as well structure of
 255   arrays (soa) or array of structures (aos) layout.
 256
 257 - **blk-push-soa/blk-push-aos/blk-pull-soa/blk-pull-aos**:
 258   The same as the unoptimized kernels without the blk prefix, except that they support
 259   spatial blocking, i.e. loop blocking of the three loops used to iterate over
 260   the lattice. Here manual work sharing for OpenMP is used.
 261
 262 - **aa-aos/aa-soa**:
 263   Straight forward implementation of AA pattern on full array with blocking support.
 264   Manual work sharing for OpenMP is used. Domain is partitioned only along the x dimension.
 265
 266 - **aa-vec-soa/aa-vec-sl-soa**:
 267   Optimized AA kernel with intrinsics on full array. aa-vec-sl-soa uses only
 268   one loop for iterating over the lattice instead of three nested ones.
 269
 270 - **list-push-soa/list-push-aos/list-pull-soa/list-pull-aos**:
 271   The same as the unoptimized kernels without the list prefix, but for indirect addressing.
 272   Here only a 1D vector of is used to store the fluid nodes, omitting the
 273   obstacles. An adjacency list is used to recover the neighborhood associations.
 274
 275 - **list-pull-split-nt-1s-soa/list-pull-split-nt-2s-soa**:
 276   Optimized variant of list-pull-soa. Chunks of the lattice are processed as
 277   once. Postcollision values are written back via nontemporal stores in 18 (1s)
 278   or 9 (2s) loops.
 279
 280 - **list-aa-aos/list-aa-soa**:
 281   Unoptimized implementation of the AA pattern for the 1D vector with adjacency
 282   list. Supported are array of structures (aos) and structure of arrays (soa)
 283   data layout is supported.
 284
 285 - **list-aa-ria-soa**:
 286   Implementation of AA pattern with intrinsics for the 1D vector with adjacency
 287   list. Furthermore it contains a vectorized even time step and run length
 288   coding to reduce the loop balance of the odd time step.
 289
 290 - **list-aa-pv-soa**:
 291   All optimizations of list-aa-ria-soa. Additional with partial vectorization
 292   of the odd time step.
 293
 294
 295 Note that all array of structures (aos) kernels might require blocking
 296 (depending on the domain size) to reach the performance of their structure of
 297 arrays (soa) counter parts.
 298
 299 The following table summarizes the properties of the kernels. Here **D** means
 300 direct addressing, i.e. full array, **I** means indirect addressing, i.e. 1D
 301 vector with adjacency list, **x** means supported, whereas **--** means unsupported.
 302 The loop balance B_l is computed for D3Q19 model with **double precision** floating
 303 point for PDFs (8 byte) and 4 byte integers for the index (adjacency list).
 304 As list-aa-ria-soa and list-aa-pv-soa support run length coding their effective
 305 loop balance depends on the geometry. The effective loop balance is printed
 306 during each run.
 307
 308
 309 ====================== =========== =========== ===== ======== ======== ============
 310 kernel name            prop. step  data layout addr. parallel blocking B_l [B/FLUP]
 311 ====================== =========== =========== ===== ======== ======== ============
 312 push-soa               OS          SoA         D     x         --      456
 313 push-aos               OS          AoS         D     x         --      456
 314 pull-soa               OS          SoA         D     x         --      456
 315 pull-aos               OS          AoS         D     x         --      456
 316 blk-push-soa           OS          SoA         D     x         x       456
 317 blk-push-aos           OS          AoS         D     x         x       456
 318 blk-pull-soa           OS          SoA         D     x         x       456
 319 blk-pull-aos           OS          AoS         D     x         x       456
 320 aa-soa                 AA          SoA         D     x         x       304
 321 aa-aos                 AA          AoS         D     x         x       304
 322 aa-vec-soa             AA          SoA         D     x         x       304
 323 aa-vec-sl-soa          AA          SoA         D     x         x       304
 324 list-push-soa          OS          SoA         I     x         x       528
 325 list-push-aos          OS          AoS         I     x         x       528
 326 list-pull-soa          OS          SoA         I     x         x       528
 327 list-pull-aos          OS          AoS         I     x         x       528
 328 list-pull-split-nt-1s  OS          SoA         I     x         x       376
 329 list-pull-split-nt-2s  OS          SoA         I     x         x       376
 330 list-aa-soa            AA          SoA         I     x         x       340
 331 list-aa-aos            AA          AoS         I     x         x       340
 332 list-aa-ria-soa        AA          SoA         I     x         x       304-342
 333 list-aa-pv-soa         AA          SoA         I     x         x       304-342
 334 ====================== =========== =========== ===== ======== ======== ============
 335
 336 Benchmarking
 337 ============
 338
 339 Correct benchmarking is a nontrivial task. Whenever benchmark results should be
 340 created make sure the binary was compiled with:
 341
 342 - ``BENCHMARK=on`` (default if not overriden) and
 343 - ``BUILD=release`` (default if not overriden) and
 344 - the correct ISA for macros is used, selected via ``ISA`` and
 345 - use ``TARCH`` to specify the architecture the compiler generates code for.
 346
 347 Intel Compiler
 348 --------------
 349
 350 For the Intel compiler one can specify depending on the target ISA extension:
 351
 352 - AVX:          ``TARCH=-xAVX``
 353 - AVX2 and FMA: ``TARCH=-xCORE-AVX2,-fma``
 354 - AVX512:       ``TARCH=-xCORE-AVX512``
 355 - KNL:          ``TARCH=-xMIC-AVX512``
 356
 357 Compiling for an architecture supporting AVX (Sandy Bridge, Ivy Bridge): ::
 358
 359   make ISA=avx TARCH=-xAVX
 360
 361
 362 Compiling for an architecture supporting AVX2 (Haswell, Broadwell): ::
 363
 364   make ISA=avx TARCH=-xCORE-AVX2,-fma
 365
 366 WARNING: ISA is here still set to ``avx`` as currently we have the FMA intrinsics not
 367 implemented. This might change in the future.
 368
 369
 370 Compiling for an architecture supporting AVX-512 (Skylake): ::
 371
 372   make ISA=avx TARCH=-xCORE-AVX512
 373
 374 WARNING: ISA is here still set to ``avx`` as currently we have no implementation for the
 375 AVX512 intrinsics. This might change in the future.
 376
 377
 378 Pinning
 379 -------
 380
 381 During benchmarking pinning should be used via the ``-pin`` parameter. Running
 382 a benchmark with 10 threads and pin them to the first 10 cores works like ::
 383
 384   $ bin/lbmbenchk-linux-intel-release-dp ... -t 10 -pin $(seq -s , 0 9)
 385
 386
 387 General Remarks
 388 ---------------
 389
 390 Things the binary does nor check or control:
 391
 392 - transparent huge pages: when allocating memory small 4 KiB pages might be
 393   replaced with larger ones. This is in general a good thing, but if this is
 394   really the case, depends on the system settings (check e.g. the status of
 395   ``/sys/kernel/mm/transparent_hugepage/enabled``).
 396   Currently ``madvise(MADV_HUGEPAGE)`` is used for allocations which are aligned to
 397   a 4 KiB page, which should be the case for the lattices.
 398   This should result in huge pages except THP is disabled on the machine.
 399   (NOTE: madvise() is used if ``HAVE_HUGE_PAGES`` is defined, which is currently
 400   hard coded defined in ``Memory.c``).
 401
 402 - CPU/core frequency: For reproducible results the frequency of all cores
 403   should be fixed.
 404
 405 - NUMA placement policy: The benchmark assumes a first touch policy, which
 406   means the memory will be placed at the NUMA domain the touching core is
 407   associated with. If a different policy is in place or the NUMA domain to be
 408   used is already full memory might be allocated in a remote domain. Accesses
 409   to remote domains typically have a higher latency and lower bandwidth.
 410
 411 - System load: interference with other application, especially on desktop
 412   systems should be avoided.
 413
 414 - Padding: For SoA based kernels the number of (fluid) nodes is automatically
 415   adjusted so that no cache or TLB thrashing should occur. The parameters are
 416   optimized for current Intel based systems. For more details look into the
 417   padding section.
 418
 419 - CPU dispatcher function: the compiler might add different versions of a
 420   function for different ISA extensions. Make sure the code you might think is
 421   executed is actually the code which is executed.
 422
 423 Padding
 424 -------
 425
 426 With correct padding cache and TLB thrashing can be avoided. Therefore the
 427 number of (fluid) nodes used in the data layout is artificially increased.
 428
 429 Currently automatic padding is active for kernels which support it. It can be
 430 controlled via the kernel parameter (i.e. parameter after the ``--``)
 431 ``-pad``. Supported values are ``auto`` (default), ``no`` (to disable padding),
 432 or a manual padding.
 433
 434 Automatic padding tries to avoid cache and TLB thrashing and pads for a 32
 435 entry (huge pages) TLB with 8 sets and a 512 set (L2) cache. This reflects the
 436 parameters of current Intel based processors.
 437
 438 Manual padding is done via a padding string and has the format
 439 ``mod_1+offset_1(,mod_n+offset_n)``, which specifies numbers of bytes.
 440 SoA data layouts can exhibit TLB thrashing. Therefore we want to distribute the
 441 19 pages with one lattice (36 with two lattices) we are concurrently accessing
 442 over as much sets in the TLB as possible.
 443 This is controlled by the distance between the accessed pages, which is the
 444 number of (fluid) nodes in between them and can be adjusted by adding further
 445 (fluid) nodes.
 446 We want the distance d (in bytes) between two accessed pages to be e.g.
 447 **d % (PAGE_SIZE * TLB_SETS) = PAGE_SIZE**.
 448 This would distribute the pages evenly over the sets. Hereby **PAGE_SIZE * TLB_SETS**
 449 would be our ``mod_1`` and **PAGE_SIZE** (after the =) our ``offset_1``.
 450 Measurements show that with only a quarter of half of a page size as offset
 451 higher performance is achieved, which is done by automatic padding.
 452 On top of this padding more paddings can be added. They are just added to the
 453 padding string and are separated by commas.
 454
 455 A zero modulus in the padding string has a special meaning. Here the
 456 corresponding offset is just added to the number of nodes. A padding string
 457 like ``-pad 0+16`` would at a static padding of two nodes (one node = 8 b).
 458
 459
 460 Geometries
 461 ==========
 462
 463 TODO: supported geometries: channel, pipe, blocks, fluid
 464
 465
 466 Performance Results
 467 ===================
 468
 469 The sections lists performance values measured on several machines for
 470 different kernels and geometries and **double precision** floating point data/arithmetic.
 471 The **RFM** column denotes the expected performance as predicted by the
 472 Roofline performance model [williams-2008]_.
 473 For performance prediction of each kernel a memory bandwidth benchmark is used
 474 which mimics the kernels memory access pattern and the kernel's loop balance
 475 (see [kernels]_ for details).
 476
 477 Machine Specifications
 478 ----------------------
 479
 480 **Ivy Bridge, Intel Xeon E5-2660 v2**
 481
 482 - Ivy Bridge architecture, AVX
 483 - 10 cores, 2.2 GHz
 484 - SMT enabled
 485 - memoy bandwidth:
 486
 487   - copy-19             32.7 GB/s
 488   - copy-19-nt-sl       35.6 GB/s
 489   - update-19           37.4 GB/s
 490
 491 **Haswell, Intel Xeon E5-2695 v3**
 492
 493 - Haswell architecture, AVX2, FMA
 494 - 14 cores, 2.3 GHz
 495 - 2 x 7 cores in cluster-on-die (CoD) mode enabled
 496 - SMT enabled
 497 - memory bandwidth:
 498
 499   - copy-19              47.3 GB/s
 500   - copy-19-nt-sl        47.1 GB/s
 501   - update-19            44.0 GB/s
 502
 503
 504 **Broadwell, Intel Xeon E5-2630 v4**
 505
 506 - Broadwell architecture, AVX2, FMA
 507 - 10 cores, 2.2 GHz
 508 - SMT disabled
 509 - memory bandwidth:
 510
 511   - copy-19              48.0 GB/s
 512   - copy-nt-sl-19        48.2 GB/s
 513   - update-19            51.1 GB/s
 514
 515 **Skylake, Intel Xeon Gold 6148**
 516
 517 NOTE: currently we only use AVX2 intrinsics.
 518
 519 - Skylake server architecture, AVX2, AVX512, 2 FMA units
 520 - 20 cores, 2.4 GHz
 521 - SMT enabled
 522 - memory bandwidth:
 523
 524   - copy-19              89.7 GB/s
 525   - copy-19-nt-sl        92.4 GB/s
 526   - update-19            93.6 GB/s
 527
 528 **Zen, AMD EPYC 7451**
 529
 530 - Zen architecture, AVX2, FMA
 531 - 24 cores, 2.3 GHz
 532 - SMT enabled
 533 - memory bandwidth:
 534
 535   - copy-19              111.9 GB/s
 536   - copy-19-nt-sl        111.7 GB/s
 537   - update-19            109.2 GB/s
 538
 539 **Zen, AMD Ryzen 7 1700X**
 540
 541 - Zen architecture, AVX2, FMA
 542 - 8 cores, 3.4 GHz
 543 - SMT enabled
 544 - memory bandwidth:
 545
 546   - copy-19              27.2 GB/s
 547   - copy-19-nt-sl        27.1 GB/s
 548   - update-19            26.1 GB/s
 549
 550 Single Socket Results
 551 ---------------------
 552
 553 - Geometry dimensions are for all measurements 500x100x100 nodes.
 554 - Note the **different scaling on the y axis** of the plots!
 555
 556 .. |perf_emmy_dp| image:: images/benchmark-emmy-dp.png
 557    :scale: 50 %
 558 .. |perf_emmy_sp| image:: images/benchmark-emmy-sp.png
 559    :scale: 50 %
 560 .. |perf_hasep1_dp| image:: images/benchmark-hasep1-dp.png
 561    :scale: 50 %
 562 .. |perf_hasep1_sp| image:: images/benchmark-hasep1-sp.png
 563    :scale: 50 %
 564 .. |perf_meggie_dp| image:: images/benchmark-meggie-dp.png
 565    :scale: 50 %
 566 .. |perf_meggie_sp| image:: images/benchmark-meggie-sp.png
 567    :scale: 50 %
 568 .. |perf_skylakesp2_dp| image:: images/benchmark-skylakesp2-dp.png
 569    :scale: 50 %
 570 .. |perf_skylakesp2_sp| image:: images/benchmark-skylakesp2-sp.png
 571    :scale: 50 %
 572 .. |perf_summitridge1_dp| image:: images/benchmark-summitridge1-dp.png
 573    :scale: 50 %
 574 .. |perf_summitridge1_sp| image:: images/benchmark-summitridge1-sp.png
 575    :scale: 50 %
 576 .. |perf_naples1_dp| image:: images/benchmark-naples1-dp.png
 577    :scale: 50 %
 578 .. |perf_naples1_sp| image:: images/benchmark-naples1-sp.png
 579    :scale: 50 %
 580
 581 .. list-table::
 582
 583   * - Ivy Bridge, Intel Xeon E5-2660 v2, Double Precision
 584   * - |perf_emmy_dp|
 585   * - Ivy Bridge, Intel Xeon E5-2660 v2, Single Precision
 586   * - |perf_emmy_sp|
 587   * - Haswell, Intel Xeon E5-2695 v3, Double Precision
 588   * - |perf_hasep1_dp|
 589   * - Haswell, Intel Xeon E5-2695 v3, Single Precision
 590   * - |perf_hasep1_sp|
 591   * - Broadwell, Intel Xeon E5-2630 v4, Double Precision
 592   * - |perf_meggie_dp|
 593   * - Broadwell, Intel Xeon E5-2630 v4, Single Precision
 594   * - |perf_meggie_sp|
 595   * - Skylake, Intel Xeon Gold 6148, Double Precision, **NOTE: currently we only use AVX2 intrinsics.**
 596   * - |perf_skylakesp2_dp|
 597   * - Skylake, Intel Xeon Gold 6148, Single Precision, **NOTE: currently we only use AVX2 intrinsics.**
 598   * - |perf_skylakesp2_sp|
 599   * - Zen, AMD Ryzen 7 1700X, Double Precision
 600   * - |perf_summitridge1_dp|
 601   * - Zen, AMD Ryzen 7 1700X, Single Precision
 602   * - |perf_summitridge1_sp|
 603   * - Zen, AMD EPYC 7451, Double Precision
 604   * - |perf_naples1_dp|
 605   * - Zen, AMD EPYC 7451, Single Precision
 606   * - |perf_naples1_sp|
 607
 608
 609 Licence
 610 =======
 611
 612 The Lattice Boltzmann Benchmark Kernels are licensed under GPLv3.
 613
 614
 615 Acknowledgements
 616 ================
 617
 618 This work was funded by BMBF, grant no. 01IH15003A (project SKAMPY).
 619
 620 This work was funded by KONWHIR project OMI4PAPS.
 621
 622
 623 Bibliography
 624 ============
 625
 626 .. [ginzburg-2008]
 627  I. Ginzburg, F. Verhaeghe, and D. d'Humières.
 628  Two-relaxation-time lattice Boltzmann scheme: About parametrization, velocity, pressure and mixed boundary conditions.
 629  Commun. Comput. Phys., 3(2):427-478, 2008.
 630
 631 .. [williams-2008]
 632  S. Williams, A. Waterman, and D. Patterson.
 633  Roofline: an insightful visual performance model for multicore architectures.
 634  Commun. ACM, 52(4):65-76, Apr 2009. doi:10.1145/1498765.1498785
 635
 636
 637 .. |datetime| date:: %Y-%m-%d %H:%M
 638
 639 Document was generated at |datetime|.
 640