doc/main.rst

   1 .. # --------------------------------------------------------------------------
   2    #
   3    # Copyright
   4    #   Markus Wittmann, 2016-2017
   5    #   RRZE, University of Erlangen-Nuremberg, Germany
   6    #   markus.wittmann -at- fau.de or hpc -at- rrze.fau.de
   7    #
   8    #   Viktor Haag, 2016
   9    #   LSS, University of Erlangen-Nuremberg, Germany
  10    #
  11    #  This file is part of the Lattice Boltzmann Benchmark Kernels (LbmBenchKernels).
  12    #
  13    #  LbmBenchKernels is free software: you can redistribute it and/or modify
  14    #  it under the terms of the GNU General Public License as published by
  15    #  the Free Software Foundation, either version 3 of the License, or
  16    #  (at your option) any later version.
  17    #
  18    #  LbmBenchKernels is distributed in the hope that it will be useful,
  19    #  but WITHOUT ANY WARRANTY; without even the implied warranty of
  20    #  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
  21    #  GNU General Public License for more details.
  22    #
  23    #  You should have received a copy of the GNU General Public License
  24    #  along with LbmBenchKernels.  If not, see <http://www.gnu.org/licenses/>.
  25    #
  26    # --------------------------------------------------------------------------
  27
  28 .. title:: LBM Benchmark Kernels Documentation
  29
  30
  31 ===================================
  32 LBM Benchmark Kernels Documentation
  33 ===================================
  34
  35 .. sectnum::
  36 .. contents::
  37
  38 Introduction
  39 ============
  40
  41 The lattice Boltzmann (LBM) benchmark kernels are a collection of LBM kernel
  42 implementations.
  43
  44 **AS SUCH THE LBM BENCHMARK KERNELS ARE NO FULLY EQUIPPED CFD SOLVER AND SOLELY
  45 SERVES THE PURPOSE OF STUDYING POSSIBLE PERFORMANCE OPTIMIZATIONS AND/OR
  46 EXPERIMENTS.**
  47
  48 Currently all kernels utilize a D3Q19 discretization and the
  49 two-relaxation-time (TRT) collision operator [ginzburg-2008]_.
  50 All operations are carried out in double precision arithmetic.
  51
  52 Compilation
  53 ===========
  54
  55 The benchmark framework currently supports only Linux systems and the GCC and
  56 Intel compilers. Every other configuration probably requires adjustment inside
  57 the code and the makefiles. Furthermore some code might be platform or at least
  58 POSIX specific.
  59
  60 The benchmark can be build via ``make`` from the ``src`` subdirectory. This will
  61 generate one binary which hosts all implemented benchmark kernels.
  62
  63 Binaries are located under the ``bin`` subdirectory and will have different names
  64 depending on compiler and build configuration.
  65
  66 Compilation can target debug or release builds. Combined with both build types
  67 verification can be enabled, which increases the runtime and hence is not
  68 suited for benchmarking.
  69
  70
  71 Debug and Verification
  72 ----------------------
  73
  74 ::
  75
  76   make BUILD=debug BENCHMARK=off
  77
  78 Running ``make`` with ``BUILD=debug`` builds the debug version of
  79 the benchmark kernels, where no optimizations are performed,  line numbers and
  80 debug symbols are included as well as ``DEBUG`` will be defined.  The resulting
  81 binary will be found in the ``bin`` subdirectory and named
  82 ``lbmbenchk-linux-<compiler>-debug``.
  83
  84 Specifying ``BENCHMARK=off`` turns on verification
  85 (``VERIFICATION=on``), statistics (``STATISTICS=on``), and VTK output
  86 (``VTK_OUTPUT=on``) enabled.
  87
  88 Please note that the generated binary will therefore
  89 exhibit a poor performance.
  90
  91
  92 Release and Verification
  93 ------------------------
  94
  95 Verification with the debug builds can be extremely slow. Hence verification
  96 capabilities can be build with release builds: ::
  97
  98   make BENCHMARK=off
  99
 100
 101 Benchmarking
 102 ------------
 103
 104 To generate a binary for benchmarking run make with ::
 105
 106   make
 107
 108 As default ``BENCHMARK=on`` and ``BUILD=release`` is set, where
 109 ``BUILD=release`` turns optimizations on and ``BENCHMARK=on`` disables
 110 verfification, statistics, and VTK output.
 111
 112 See Options Summary below for further description of options which can be
 113 applied, e.g. TARCH as well as the Benchmarking section.
 114
 115 Compilers
 116 ---------
 117
 118 Currently only the GCC and Intel compiler under Linux are supported. Between
 119 both configuration can be chosen via ``CONFIG=linux-gcc`` or
 120 ``CONFIG=linux-intel``.
 121
 122
 123 Cleaning
 124 --------
 125
 126 For each configuration and build (debug/release) a subdirectory under the
 127 ``src/obj`` directory is created where the dependency and object files are
 128 stored.
 129 With ::
 130
 131   make CONFIG=... BUILD=... clean
 132
 133 a specific combination is select and cleaned, whereas with ::
 134
 135   make clean-all
 136
 137 all object and dependency files are deleted.
 138
 139
 140 Options Summary
 141 ---------------
 142
 143 Options that can be specified when building the suite with make:
 144
 145 ============= ======================= ============ ==========================================================
 146 name          values                  default      description
 147 ============= ======================= ============ ==========================================================
 148 BENCHMARK     on, off                 on           If enabled, disables VERIFICATION, STATISTICS, VTK_OUTPUT. If disabled enables the three former options.
 149 BUILD         debug, release          release      debug: no optimization, debug symbols, DEBUG defined. release: optimizations enabled.
 150 CONFIG        linux-gcc, linux-intel  linux-intel  Select GCC or Intel compiler.
 151 ISA           avx, sse                avx          Determines which ISA extension is used for macro definitions of the intrinsics. This is *not* the architecture the compiler generates code for.
 152 OPENMP        on, off                 on           OpenMP, i.\,e.\. threading support.
 153 STATISTICS    on, off                 off          View statistics, like density etc, during simulation.
 154 TARCH         --                      --           Via TARCH the architecture the compiler generates code for can be overridden. The value depends on the chosen compiler.
 155 VERIFICATION  on, off                 off          Turn verification on/off.
 156 VTK_OUTPUT    on, off                 off          Enable/Disable VTK file output.
 157 ============= ======================= ============ ==========================================================
 158
 159 Invocation
 160 ==========
 161
 162 Running the binary will print among the GPL licence header a line like the following: ::
 163
 164   LBM Benchmark Kernels 0.1, compiled Jul  5 2017 21:59:22, type: verification
 165
 166 if verfication was enabled during compilation or ::
 167
 168   LBM Benchmark Kernels 0.1, compiled Jul  5 2017 21:59:22, type: benchmark
 169
 170 if verfication was disabled during compilation.
 171
 172 Command Line Parameters
 173 -----------------------
 174
 175 Running the binary with ``-h`` list all available parameters: ::
 176
 177   Usage:
 178   ./lbmbenchk -list
 179   ./lbmbenchk
 180       [-dims XxYyZ] [-geometry box|channel|pipe|blocks[-<block size>]] [-iterations <iterations>] [-lattice-dump-ascii]
 181       [-rho-in <density>] [-rho-out <density] [-omega <omega>] [-kernel <kernel>]
 182       [-periodic-x]
 183       [-t <number of threads>]
 184       [-pin core{,core}*]
 185       [-verify]
 186       -- <kernel specific parameters>
 187
 188   -list           List available kernels.
 189
 190   -dims XxYxZ     Specify geometry dimensions.
 191
 192   -geometry blocks-<block size>
 193                   Geometetry with blocks of size <block size> regularily layout out.
 194
 195
 196 If an option is specified multiple times the last one overrides previous ones.
 197 This holds also true for ``-verify`` which sets geometry dimensions,
 198 iterations, etc, which can afterward be override, e.g.: ::
 199
 200   $ bin/lbmbenchk-linux-intel-release -verfiy -dims 32x32x32
 201
 202 Kernel specific parameters can be obtained via selecting the specific kernel
 203 and passing ``-h`` as parameter: ::
 204
 205   $ bin/lbmbenchk-linux-intel-release -kernel kernel-name -- -h
 206   ...
 207   Kernel parameters:
 208   [-blk <n>] [-blk-[xyz] <n>]
 209
 210
 211 A list of all available kernels can be obtained via ``-list``: ::
 212
 213   $ ../bin/lbmbenchk-linux-gcc-debug -list
 214   Lattice Boltzmann Benchmark Kernels (LbmBenchKernels) Copyright (C) 2016, 2017 LSS, RRZE
 215   This program comes with ABSOLUTELY NO WARRANTY; for details see LICENSE.
 216   This is free software, and you are welcome to redistribute it under certain conditions.
 217
 218   LBM Benchmark Kernels 0.1, compiled Jul  5 2017 21:59:22, type: verification
 219   Available kernels to benchmark:
 220      list-aa-pv-soa
 221      list-aa-ria-soa
 222      list-aa-soa
 223      list-aa-aos
 224      list-pull-split-nt-1s-soa
 225      list-pull-split-nt-2s-soa
 226      list-push-soa
 227      list-push-aos
 228      list-pull-soa
 229      list-pull-aos
 230      push-soa
 231      push-aos
 232      pull-soa
 233      pull-aos
 234      blk-push-soa
 235      blk-push-aos
 236      blk-pull-soa
 237      blk-pull-aos
 238
 239 Kernels
 240 -------
 241
 242 The following list shortly describes available kernels:
 243
 244 - push-soa/push-aos/pull-soa/pull-aos:
 245   Unoptimized kernels (but stream/collide are already fused) using two grids as
 246   source and destination. Implement push/pull semantics as well structure of
 247   arrays (soa) or array of structures (aos) layout.
 248
 249 - blk-push-soa/blk-push-aos/blk-pull-soa/blk-pull-aos:
 250   The same as the unoptimized kernels without the blk prefix, except that they support
 251   spatial blocking, i.e. loop blocking of the three loops used to iterate over
 252   the lattice. Here manual work sharing for OpenMP is used.
 253
 254 - list-push-soa/list-push-aos/list-pull-soa/list-pull-aos:
 255   The same as the unoptimized kernels without the list prefix, but for indirect addressing.
 256   Here only a 1D vector of is used to store the fluid nodes, omitting the
 257   obstacles. An adjacency list is used to recover the neighborhood associations.
 258
 259 - list-pull-split-nt-1s-soa/list-pull-split-nt-2s-soa:
 260   Optimized variant of list-pull-soa. Chunks of the lattice are processed as
 261   once. Postcollision values are written back via nontemporal stores in 18 (1s)
 262   or 9 (2s) loops.
 263
 264 - list-aa-aos/list-aa-soa:
 265   Unoptimized implementation of the AA pattern for the 1D vector with adjacency
 266   list. Supported are array of structures (aos) and structure of arrays (soa)
 267   data layout is supported.
 268
 269 - list-aa-ria-soa:
 270   Implementation of AA pattern with intrinsics for the 1D vector with adjacency
 271   list. Furthermore it contains a vectorized even time step and run length
 272   coding to reduce the loop balance of the odd time step.
 273
 274 - list-aa-pv-soa:
 275   All optimizations of list-aa-ria-soa. Additional with partial vectorization
 276   of the odd time step.
 277
 278
 279 Note that all array of structures (aos) kernels might require blocking
 280 (depending on the domain size) to reach the performance of their structure of
 281 arrays (soa) counter parts.
 282
 283 The following table summarizes the properties of the kernels. Here **D** means
 284 direct addressing, i.e. full array, **I** means indirect addressing, i.e. 1D
 285 vector with adjacency list, **x** means supported, whereas **--** means unsupported.
 286 The loop balance B_l is computed for D3Q19 model with double precision floating
 287 point for PDFs (8 byte) and 4 byte integers for the index (adjacency list).
 288 As list-aa-ria-soa and list-aa-pv-soa support run length coding their effective
 289 loop balance depends on the geometry. The effective loop balance is printed
 290 during each run.
 291
 292
 293 ====================== =========== =========== ===== ======== ======== ============
 294 kernel name            prop. step  data layout addr. parallel blocking B_l [B/FLUP]
 295 ====================== =========== =========== ===== ======== ======== ============
 296 push-soa               OS          SoA         D     x         --      456
 297 push-aos               OS          AoS         D     x         --      456
 298 pull-soa               OS          SoA         D     x         --      456
 299 pull-aos               OS          AoS         D     x         --      456
 300 blk-push-soa           OS          SoA         D     x         x       456
 301 blk-push-aos           OS          AoS         D     x         x       456
 302 blk-pull-soa           OS          SoA         D     x         x       456
 303 blk-pull-aos           OS          AoS         D     x         x       456
 304 list-push-soa          OS          SoA         I     x         x       528
 305 list-push-aos          OS          AoS         I     x         x       528
 306 list-pull-soa          OS          SoA         I     x         x       528
 307 list-pull-aos          OS          AoS         I     x         x       528
 308 list-pull-split-nt-1s  OS          SoA         I     x         x       376
 309 list-pull-split-nt-2s  OS          SoA         I     x         x       376
 310 list-aa-soa            AA          SoA         I     x         x       340
 311 list-aa-aos            AA          AoS         I     x         x       340
 312 list-aa-ria-soa        AA          SoA         I     x         x       304-342
 313 list-aa-pv-soa         AA          SoA         I     x         x       304-342
 314 ====================== =========== =========== ===== ======== ======== ============
 315
 316 Benchmarking
 317 ============
 318
 319 Correct benchmarking is a nontrivial task. Whenever benchmark results should be
 320 created make sure the binary was compiled with:
 321
 322 - ``BENCHMARK=on`` (default if not overriden) and
 323 - ``BUILD=release`` (default if not overriden) and
 324 - the correct ISA for macros is used, selected via ``ISA`` and
 325 - use ``TARCH`` to specify the architecture the compiler generates code for.
 326
 327 Intel Compiler
 328 --------------
 329
 330 For the Intel compiler one can specify depending on the target ISA extension:
 331
 332 - AVX:          ``TARCH=-xAVX``
 333 - AVX2 and FMA: ``TARCH=-xCORE-AVX2,-fma``
 334 - AVX512:       ``TARCH=-xCORE-AVX512``
 335 - KNL:          ``TARCH=-xMIC-AVX512``
 336
 337 Compiling for an architecture supporting AVX (Sandy Bridge, Ivy Bridge): ::
 338
 339   make ISA=avx TARCH=-xAVX
 340
 341
 342 Compiling for an architecture supporting AVX2 (Haswell, Broadwell): ::
 343
 344   make ISA=avx TARCH=-xCORE-AVX2,-fma
 345
 346 WARNING: ISA is here still set to ``avx`` as currently we have the FMA intrinsics not
 347 implemented. This might change in the future.
 348
 349
 350 Compiling for an architecture supporting AVX-512 (Skylake): ::
 351
 352   make ISA=avx TARCH=-xCORE-AVX512
 353
 354 WARNING: ISA is here still set to ``avx`` as currently we have no implementation for the
 355 AVX512 intrinsics. This might change in the future.
 356
 357
 358 Pinning
 359 -------
 360
 361 During benchmarking pinning should be used via the ``-pin`` parameter. Running
 362 a benchmark with 10 threads and pin them to the first 10 cores works like ::
 363
 364   $ bin/lbmbenchk-linux-intel-release ... -t 10 -pin $(seq -s , 0 9)
 365
 366
 367 General Remarks
 368 ---------------
 369
 370 Things the binary does nor check or control:
 371
 372 - transparent huge pages: when allocating memory small 4 KiB pages might be
 373   replaced with larger ones. This is in general a good thing, but if this is
 374   really the case, depends on the system settings (check e.g. the status of
 375   ``/sys/kernel/mm/transparent_hugepage/enabled``).
 376   Currently ``madvise(MADV_HUGEPAGE)`` is used for allocations which are aligned to
 377   a 4 KiB page, which should be the case for the lattices.
 378   This should result in huge pages except THP is disabled on the machine.
 379   (NOTE: madvise() is used if ``HAVE_HUGE_PAGES`` is defined, which is currently
 380   hard coded defined in ``Memory.c``).
 381
 382 - CPU/core frequency: For reproducible results the frequency of all cores
 383   should be fixed.
 384
 385 - NUMA placement policy: The benchmark assumes a first touch policy, which
 386   means the memory will be placed at the NUMA domain the touching core is
 387   associated with. If a different policy is in place or the NUMA domain to be
 388   used is already full memory might be allocated in a remote domain. Accesses
 389   to remote domains typically have a higher latency and lower bandwidth.
 390
 391 - System load: interference with other application, especially on desktop
 392   systems should be avoided.
 393
 394 - Padding: For SoA based kernels the number of (fluid) nodes is automatically
 395   adjusted so that no cache or TLB thrashing should occur. The parameters are
 396   optimized for current Intel based systems. For more details look into the
 397   padding section.
 398
 399 - CPU dispatcher function: the compiler might add different versions of a
 400   function for different ISA extensions. Make sure the code you might think is
 401   executed is actually the code which is executed.
 402
 403 Padding
 404 -------
 405
 406 With correct padding cache and TLB thrashing can be avoided. Therefore the
 407 number of (fluid) nodes used in the data layout is artificially increased.
 408
 409 Currently automatic padding is active for kernels which support it. It can be
 410 controlled via the kernel parameter (i.e. parameter after the ``--``)
 411 ``-pad``. Supported values are ``auto`` (default), ``no`` (to disable padding),
 412 or a manual padding.
 413
 414 Automatic padding tries to avoid cache and TLB thrashing and pads for a 32
 415 entry (huge pages) TLB with 8 sets and a 512 set (L2) cache. This reflects the
 416 parameters of current Intel based processors.
 417
 418 Manual padding is done via a padding string and has the format
 419 ``mod_1+offset_1(,mod_n+offset_n)``, which specifies numbers of bytes.
 420 SoA data layouts can exhibit TLB thrashing. Therefore we want to distribute the
 421 19 pages with one lattice (36 with two lattices) we are concurrently accessing
 422 over as much sets in the TLB as possible.
 423 This is controlled by the distance between the accessed pages, which is the
 424 number of (fluid) nodes in between them and can be adjusted by adding further
 425 (fluid) nodes.
 426 We want the distance d (in bytes) between two accessed pages to be e.g.
 427 **d % (PAGE_SIZE * TLB_SETS) = PAGE_SIZE**.
 428 This would distribute the pages evenly over the sets. Hereby **PAGE_SIZE * TLB_SETS**
 429 would be our ``mod_1`` and **PAGE_SIZE** (after the =) our ``offset_1``.
 430 Measurements show that with only a quarter of half of a page size as offset
 431 higher performance is achieved, which is done by automatic padding.
 432 On top of this padding more paddings can be added. They are just added to the
 433 padding string and are separated by commas.
 434
 435 A zero modulus in the padding string has a special meaning. Here the
 436 corresponding offset is just added to the number of nodes. A padding string
 437 like ``-pad 0+16`` would at a static padding of two nodes (one node = 8 b).
 438
 439
 440 Geometries
 441 ==========
 442
 443 TODO: supported geometries: channel, pipe, blocks, fluid
 444
 445
 446 Performance Results
 447 ===================
 448
 449 The sections lists performance values measured on several machines for
 450 different kernels and geometries.
 451 The **RFM** column denotes the expected performance as predicted by the
 452 Roofline performance model [williams-2008]_.
 453 For performance prediction of each kernel a memory bandwidth benchmark is used
 454 which mimics the kernels memory access pattern and the kernel's loop balance
 455 (see [kernels]_ for details).
 456
 457 Haswell, Intel Xeon E5-2695 v3
 458 ------------------------------
 459
 460 - Haswell architecture, AVX2, FMA
 461 - 14 cores, 2,3 GHz
 462 - 2 x 7 cores in cluster-on-die (CoD) mode enabled
 463 - SMT enabled
 464
 465 memory bandwidth:
 466
 467 - copy-19              47.3 GB/s
 468 - copy-19-nt-sl        47.1 GB/s
 469 - update-19            44.0 GB/s
 470
 471 geometry dimensions:  500x100x100
 472
 473 =========================    =========  =========  =========  =========  =========  =========  =========  =========  =========  =========  =========  =====
 474 kernel                            pipe   blocks-2   blocks-4   blocks-6   blocks-8  blocks-10  blocks-15  blocks-16  blocks-20  blocks-25  blocks-32  RFM
 475 =========================    =========  =========  =========  =========  =========  =========  =========  =========  =========  =========  =========  =====
 476 blk-push-aos                     58.82      49.85      57.34      59.90      61.37      62.17      65.30      64.00      67.54      64.46      69.69   104
 477 blk-push-soa                     32.32      33.46      34.02      34.64      35.06      35.04      36.31      35.44      37.20      35.14      37.95   104
 478 blk-pull-aos                     56.97      51.41      56.09      57.92      59.98      59.83      63.37      61.55      65.50      63.11      67.02   104
 479 blk-pull-soa                     49.29      46.23      47.50      51.97      51.27      49.52      55.23      53.13      54.50      49.79      57.90   104
 480 aa-aos                           91.35      66.14      76.80      84.76      83.63      91.36      93.46      92.62      93.91      92.25      92.93   145
 481 aa-soa                           75.51      65.68      70.94      71.36      73.83      75.46      74.84      79.48      83.28      77.70      82.72   145
 482 aa-vec-soa                       93.85      83.44      91.58      93.96      94.35      96.62     101.76      96.72     106.37     102.60     110.28   145
 483 list-push-aos                    80.29      80.97      80.95      81.10      81.37      82.44      81.77      81.49      80.72      81.93      80.93   83
 484 list-push-soa                    47.52      42.65      45.28      46.64      43.46      40.59      44.94      46.55      41.53      45.98      44.86   83
 485 list-pull-aos                    85.30      82.97      86.43      83.42      86.33      83.70      86.43      83.77      83.10      85.89      84.44   83
 486 list-pull-soa                    62.12      63.61      63.28      61.32      66.72      62.65      64.82      60.49      58.01      64.46      62.52   83
 487 list-pull-split-nt-1s-soa       121.35     113.77     115.29     113.54     117.00     116.46     114.78     114.54     110.83     112.67     117.85   125
 488 list-pull-split-nt-2s-soa       118.09     110.48     112.55     113.18     113.44     111.85     109.27     114.41     110.28     111.78     113.74   125
 489 list-aa-aos                     121.28     118.63     119.00     118.50     121.99     119.11     118.83     121.47     121.62     126.18     120.12   129
 490 list-aa-soa                     126.34     116.90     129.45     127.12     129.41     121.42     126.19     126.76     126.70     124.40     125.22   129
 491 list-aa-ria-soa                 133.68     121.82     126.04     128.46     131.15     132.25     128.78     133.50     126.69     124.40     130.37   145
 492 list-aa-pv-soa                  146.22     124.39     130.73     136.29     137.61     131.21     138.65     138.78     127.02     132.40     138.37   145
 493 =========================    =========  =========  =========  =========  =========  =========  =========  =========  =========  =========  =========  =====
 494
 495
 496 Broadwell, Intel Xeon E5-2630 v4
 497 --------------------------------
 498
 499 - Broadwell architecture, AVX2, FMA
 500 - 10 cores, 2.2 GHz
 501 - SMT disabled
 502
 503 memory bandwidth:
 504
 505 - copy-19              48.0 GB/s
 506 - copy-nt-sl-19        48.2 GB/s
 507 - update-19            51.1 GB/s
 508
 509 geometry dimensions:  500x100x100
 510
 511 =========================   =========  =========  =========  =========  =========  =========  =========  =========  =========  =========  =========  =======
 512 kernel                           pipe   blocks-2   blocks-4   blocks-6   blocks-8  blocks-10  blocks-15  blocks-16  blocks-20  blocks-25  blocks-32  RFM
 513 =========================   =========  =========  =========  =========  =========  =========  =========  =========  =========  =========  =========  =======
 514 blk-push-aos                    55.75      47.62      54.57      57.10      58.49      59.00      61.72      60.56      64.05      61.10      66.03  105
 515 blk-push-soa                    30.06      31.09      32.13      32.54      32.74      32.72      33.81      33.19      34.90      33.21      35.75  105
 516 blk-pull-aos                    53.80      48.61      53.08      54.99      56.08      56.68      59.20      58.12      61.49      58.71      63.45  105
 517 blk-pull-soa                    46.96      46.61      48.84      49.70      50.33      50.46      52.36      51.39      54.20      51.61      55.71  105
 518 aa-aos                          91.40      66.99      78.47      83.38      86.62      88.62      92.98      91.54      97.08      94.93      98.90  168
 519 aa-soa                          83.01      69.96      75.85      77.72      79.01      79.29      82.38      80.11      85.70      83.91      87.69  168
 520 aa-vec-soa                     112.03      96.52     105.32     109.76     112.55     113.82     120.55     118.37     126.30     121.37     131.94  168
 521 list-push-aos                   75.13      74.18      75.20      75.42      75.24      75.99      75.80      75.80      75.54      76.22      76.21   97
 522 list-push-soa                   40.99      38.14      39.00      38.89      38.89      39.67      39.87      39.28      39.35      40.08      40.13   97
 523 list-pull-aos                   82.07      82.88      83.29      83.09      83.32      83.49      82.82      82.88      83.32      82.60      82.93   97
 524 list-pull-soa                   62.07      60.40      61.89      61.39      62.43      60.90      60.48      62.80      62.50      61.10      60.38   97
 525 list-pull-split-nt-1s-soa      125.81     120.60     121.96     122.34     122.86     123.53     123.64     123.67     125.94     124.09     123.69  128
 526 list-pull-split-nt-2s-soa      122.79     117.16     118.86     119.16     119.56     119.99     120.01     120.03     122.64     120.57     120.39  128
 527 list-aa-aos                    128.13     127.41     129.31     129.07     129.79     129.63     129.67     129.94     129.12     128.41     129.72  150
 528 list-aa-soa                    141.60     139.78     141.58     142.16     141.94     141.31     142.37     142.25     142.43     141.40     142.26  150
 529 list-aa-ria-soa                141.82     134.88     140.15     140.72     141.67     140.51     141.18     141.29     142.97     141.94     143.25  168
 530 list-aa-pv-soa                 164.79     140.95     159.24     161.78     162.40     163.04     164.69     164.38     165.11     165.75     166.09  168
 531 =========================   =========  =========  =========  =========  =========  =========  =========  =========  =========  =========  =========  =======
 532
 533
 534 Skylake, Intel Xeon Gold 6148
 535 -----------------------------
 536
 537 - Skylake architecture, AVX2, FMA, AVX512
 538 - 20 cores, 2.4 GHz
 539 - SMT enabled
 540
 541 memory bandwidth:
 542
 543 - copy-19                  89.7 GB/s
 544 - copy-19-nt-sl            92.4 GB/s
 545 - update-19                93.6 GB/s
 546
 547 geometry dimensions:  500x100x100
 548
 549
 550 =========================    =========  =========  =========  =========  =========  =========  =========  =========  =========  =========  =========  ===
 551 kernel                            pipe   blocks-2   blocks-4   blocks-6   blocks-8  blocks-10  blocks-15  blocks-16  blocks-20  blocks-25  blocks-32  RFM
 552 =========================    =========  =========  =========  =========  =========  =========  =========  =========  =========  =========  =========  ===
 553 blk-push-aos                    113.01      93.99     108.98     114.65     117.87     119.47     124.95     122.46     129.29     123.87     133.01  197
 554 blk-push-soa                    100.21      98.87     103.63     105.56     107.02     107.27     111.61     109.83     116.16     110.51     110.29  197
 555 blk-pull-aos                    118.45     102.54     114.12     117.82     122.69     124.31     130.58     127.85     135.72     129.65     139.94  197
 556 blk-pull-soa                     82.60      83.36      87.13      88.39      88.84      88.96      92.48      90.93      95.79      91.92      98.64  197
 557 aa-aos                          171.32     125.43     147.73     157.70     163.35     167.25     175.39     174.20     182.54     173.67     187.76  308
 558 aa-soa                          180.85     152.39     165.84     152.59     171.90     175.76     184.94     182.34     189.43     180.30     193.54  308
 559 aa-vec-soa                      208.03     181.51     195.86     203.41     209.08     212.34     224.05     219.49     234.31     225.92     245.22  308
 560 list-push-aos                   158.81     164.67     162.93     163.05     165.22     164.31     164.66     160.78     164.07     165.19     164.06  177
 561 list-push-soa                   134.60     110.44     110.17     132.01     132.95     133.46     134.37     134.33     135.12     134.91     137.87  177
 562 list-pull-aos                   169.61     170.03     170.89     170.90     171.20     171.60     172.09     171.95     169.48     172.08     171.02  177
 563 list-pull-soa                   120.50     116.73     118.62     118.00     120.99     118.15     117.17     121.41     120.83     120.00     118.74  177
 564 list-pull-split-nt-1s-soa       225.59     224.18     225.10     226.34     226.01     230.37     227.50     228.42     227.39     231.65     227.35  246
 565 list-pull-split-nt-2s-soa       219.20     214.63     217.61     218.13     219.07     221.01     219.88     220.09     220.62     221.68     220.58  246
 566 list-aa-aos                     241.39     239.27     239.53     242.56     242.46     243.00     242.91     242.46     241.24     242.96     241.52  275
 567 list-aa-soa                     273.73     268.49     268.48     271.79     275.29     274.56     277.18     272.67     274.21     275.24     278.21  275
 568 list-aa-ria-soa                 288.42     261.89     273.26     284.84     283.88     288.29     290.72     289.81     293.36     290.75     292.93  308
 569 list-aa-pv-soa                  303.35     267.21     289.18     294.96     294.36     298.16     300.45     301.71     302.37     302.88     304.46  308
 570 =========================    =========  =========  =========  =========  =========  =========  =========  =========  =========  =========  =========  ===
 571
 572 Licence
 573 =======
 574
 575 The Lattice Boltzmann Benchmark Kernels are licensed under GPLv3.
 576
 577
 578 Acknowledgements
 579 ================
 580
 581 This work was funded by BMBF, grant no. 01IH15003A (project SKAMPY).
 582
 583 This work was funded by KONWHIR project OMI4PAPS.
 584
 585
 586 Bibliography
 587 ============
 588
 589 .. [ginzburg-2008]
 590  I. Ginzburg, F. Verhaeghe, and D. d'Humières.
 591  Two-relaxation-time lattice Boltzmann scheme: About parametrization, velocity, pressure and mixed boundary conditions.
 592  Commun. Comput. Phys., 3(2):427-478, 2008.
 593
 594 .. [williams-2008]
 595  S. Williams, A. Waterman, and D. Patterson.
 596  Roofline: an insightful visual performance model for multicore architectures.
 597  Commun. ACM, 52(4):65-76, Apr 2009. doi:10.1145/1498765.1498785
 598
 599
 600 .. |datetime| date:: %Y-%m-%d %H:%M
 601
 602 Document was generated at |datetime|.
 603