.. sectnum::
.. contents::
+Introduction
+============
+
+The lattice Boltzmann (LBM) benchmark kernels are a collection of LBM kernel
+implementations.
+
+**AS SUCH THE LBM BENCHMARK KERNELS ARE NO FULLY EQUIPPED CFD SOLVER AND SOLELY
+SERVES THE PURPOSE OF STUDYING POSSIBLE PERFORMANCE OPTIMIZATIONS AND/OR
+EXPERIMENTS.**
+
+Currently all kernels utilize a D3Q19 discretization and the
+two-relaxation-time (TRT) collision operator [ginzburg-2008]_.
+All operations are carried out in double precision arithmetic.
+
Compilation
===========
The benchmark framework currently supports only Linux systems and the GCC and
Intel compilers. Every other configuration probably requires adjustment inside
-the code and the makefiles. Further some code might be platform or at least
+the code and the makefiles. Furthermore some code might be platform or at least
POSIX specific.
The benchmark can be build via ``make`` from the ``src`` subdirectory. This will
Binaries are located under the ``bin`` subdirectory and will have different names
depending on compiler and build configuration.
+Compilation can target debug or release builds. Combined with both build types
+verification can be enabled, which increases the runtime and hence is not
+suited for benchmarking.
+
+
Debug and Verification
----------------------
Please note that the generated binary will therefore
exhibit a poor performance.
+
+Release and Verification
+------------------------
+
+Verification with the debug builds can be extremely slow. Hence verification
+capabilities can be build with release builds: ::
+
+ make BENCHMARK=off
+
+
Benchmarking
------------
make
As default ``BENCHMARK=on`` and ``BUILD=release`` is set, where
-BUILD=release turns optimizations on and ``BENCHMARK=on`` disables
+``BUILD=release`` turns optimizations on and ``BENCHMARK=on`` disables
verfification, statistics, and VTK output.
-Release and Verification
-------------------------
-
-Verification with the debug builds can be extremely slow. Hence verification
-capabilities can be build with release builds: ::
-
- make BENCHMARK=off
+See Options Summary below for further description of options which can be
+applied, e.g. TARCH as well as the Benchmarking section.
Compilers
---------
Options Summary
---------------
-Options that can be specified when building the framework with make:
+Options that can be specified when building the suite with make:
============= ======================= ============ ==========================================================
name values default description
-------------- ----------------------- ------------ ----------------------------------------------------------
+============= ======================= ============ ==========================================================
BENCHMARK on, off on If enabled, disables VERIFICATION, STATISTICS, VTK_OUTPUT. If disabled enables the three former options.
-BUILD debug, release release No optimization, debug symbols, DEBUG defined.
+BUILD debug, release release debug: no optimization, debug symbols, DEBUG defined. release: optimizations enabled.
CONFIG linux-gcc, linux-intel linux-intel Select GCC or Intel compiler.
-ISA avx, sse avx Determines which ISA extension is used for macro definitions. This is *not* the architecture the compiler generates code for.
+ISA avx, sse avx Determines which ISA extension is used for macro definitions of the intrinsics. This is *not* the architecture the compiler generates code for.
OPENMP on, off on OpenMP, i.\,e.\. threading support.
STATISTICS on, off off View statistics, like density etc, during simulation.
TARCH -- -- Via TARCH the architecture the compiler generates code for can be overridden. The value depends on the chosen compiler.
$ bin/lbmbenchk-linux-intel-release -verfiy -dims 32x32x32
-Kernel specific parameters can be opatained via selecting the specific kernel
+Kernel specific parameters can be obtained via selecting the specific kernel
and passing ``-h`` as parameter: ::
$ bin/lbmbenchk-linux-intel-release -kernel kernel-name -- -h
- ``BUILD=release`` (default if not overriden) and
- the correct ISA for macros is used, selected via ``ISA`` and
- use ``TARCH`` to specify the architecture the compiler generates code for.
+
+Intel Compiler
+--------------
+
+For the Intel compiler one can specify depending on the target ISA extension:
+
+- AVX: ``TARCH=-xAVX``
+- AVX2 and FMA: ``TARCH=-xCORE-AVX2,-fma``
+- AVX512: ``TARCH=-xCORE-AVX512``
+- KNL: ``TARCH=-xMIC-AVX512``
+
+Compiling for an architecture supporting AVX (Sandy Bridge, Ivy Bridge): ::
+
+ make ISA=avx TARCH=-xAVX
+
+
+Compiling for an architecture supporting AVX2 (Haswell, Broadwell): ::
+
+ make ISA=avx TARCH=-xCORE-AVX2,-fma
+
+WARNING: ISA is here still set to ``avx`` as currently we have the FMA intrinsics not
+implemented. This might change in the future.
+
+
+Compiling for an architecture supporting AVX-512 (Skylake): ::
+
+ make ISA=avx TARCH=-xCORE-AVX512
+
+WARNING: ISA is here still set to ``avx`` as currently we have no implementation for the
+AVX512 intrinsics. This might change in the future.
+
+
+Pinning
+-------
During benchmarking pinning should be used via the ``-pin`` parameter. Running
-a benchmark with 10 threads an pin them to the first 10 cores works like ::
+a benchmark with 10 threads and pin them to the first 10 cores works like ::
$ bin/lbmbenchk-linux-intel-release ... -t 10 -pin $(seq -s , 0 9)
-Things the binary does nor check or controll:
+
+General Remarks
+---------------
+
+Things the binary does nor check or control:
- transparent huge pages: when allocating memory small 4 KiB pages might be
replaced with larger ones. This is in general a good thing, but if this is
used is already full memory might be allocated in a remote domain. Accesses
to remote domains typically have a higher latency and lower bandwidth.
-- System load: interference with other application, espcially on desktop
+- System load: interference with other application, especially on desktop
systems should be avoided.
- Padding: For SoA based kernels the number of (fluid) nodes is automatically
Geometries
==========
-TODO: supported geometries: channel, pipe, blocks
-
-
-Results
-=======
-
-TODO
-
+TODO: supported geometries: channel, pipe, blocks, fluid
+
+
+Performance Results
+===================
+
+The sections lists performance values measured on several machines for
+different kernels and geometries.
+The **RFM** column denotes the expected performance as predicted by the
+Roofline performance model [williams-2008]_.
+For performance prediction of each kernel a memory bandwidth benchmark is used
+which mimics the kernels memory access pattern and the kernel's loop balance
+(see [kernels]_ for details).
+
+Haswell, Intel Xeon E5-2695 v3
+------------------------------
+
+- Haswell architecture, AVX2, FMA
+- 14 cores, 2,3 GHz
+- 2 x 7 cores in cluster-on-die (CoD) mode enabled
+- SMT enabled
+
+memory bandwidth:
+
+- copy-19 47.3 GB/s
+- copy-19-nt-sl 47.1 GB/s
+- update-19 44.0 GB/s
+
+geometry dimensions: 500x100x100
+
+========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= =====
+kernel pipe blocks-2 blocks-4 blocks-6 blocks-8 blocks-10 blocks-15 blocks-16 blocks-20 blocks-25 blocks-32 RFM
+========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= =====
+blk-push-aos 58.82 49.85 57.34 59.90 61.37 62.17 65.30 64.00 67.54 64.46 69.69 104
+blk-push-soa 32.32 33.46 34.02 34.64 35.06 35.04 36.31 35.44 37.20 35.14 37.95 104
+blk-pull-aos 56.97 51.41 56.09 57.92 59.98 59.83 63.37 61.55 65.50 63.11 67.02 104
+blk-pull-soa 49.29 46.23 47.50 51.97 51.27 49.52 55.23 53.13 54.50 49.79 57.90 104
+aa-aos 91.35 66.14 76.80 84.76 83.63 91.36 93.46 92.62 93.91 92.25 92.93 145
+aa-soa 75.51 65.68 70.94 71.36 73.83 75.46 74.84 79.48 83.28 77.70 82.72 145
+aa-vec-soa 93.85 83.44 91.58 93.96 94.35 96.62 101.76 96.72 106.37 102.60 110.28 145
+list-push-aos 80.29 80.97 80.95 81.10 81.37 82.44 81.77 81.49 80.72 81.93 80.93 83
+list-push-soa 47.52 42.65 45.28 46.64 43.46 40.59 44.94 46.55 41.53 45.98 44.86 83
+list-pull-aos 85.30 82.97 86.43 83.42 86.33 83.70 86.43 83.77 83.10 85.89 84.44 83
+list-pull-soa 62.12 63.61 63.28 61.32 66.72 62.65 64.82 60.49 58.01 64.46 62.52 83
+list-pull-split-nt-1s-soa 121.35 113.77 115.29 113.54 117.00 116.46 114.78 114.54 110.83 112.67 117.85 125
+list-pull-split-nt-2s-soa 118.09 110.48 112.55 113.18 113.44 111.85 109.27 114.41 110.28 111.78 113.74 125
+list-aa-aos 121.28 118.63 119.00 118.50 121.99 119.11 118.83 121.47 121.62 126.18 120.12 129
+list-aa-soa 126.34 116.90 129.45 127.12 129.41 121.42 126.19 126.76 126.70 124.40 125.22 129
+list-aa-ria-soa 133.68 121.82 126.04 128.46 131.15 132.25 128.78 133.50 126.69 124.40 130.37 145
+list-aa-pv-soa 146.22 124.39 130.73 136.29 137.61 131.21 138.65 138.78 127.02 132.40 138.37 145
+========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= =====
+
+
+Broadwell, Intel Xeon E5-2630 v4
+--------------------------------
+
+- Broadwell architecture, AVX2, FMA
+- 10 cores, 2.2 GHz
+- SMT disabled
+
+memory bandwidth:
+
+- copy-19 48.0 GB/s
+- copy-nt-sl-19 48.2 GB/s
+- update-19 51.1 GB/s
+
+geometry dimensions: 500x100x100
+
+========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= =======
+kernel pipe blocks-2 blocks-4 blocks-6 blocks-8 blocks-10 blocks-15 blocks-16 blocks-20 blocks-25 blocks-32 RFM
+========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= =======
+blk-push-aos 55.75 47.62 54.57 57.10 58.49 59.00 61.72 60.56 64.05 61.10 66.03 105
+blk-push-soa 30.06 31.09 32.13 32.54 32.74 32.72 33.81 33.19 34.90 33.21 35.75 105
+blk-pull-aos 53.80 48.61 53.08 54.99 56.08 56.68 59.20 58.12 61.49 58.71 63.45 105
+blk-pull-soa 46.96 46.61 48.84 49.70 50.33 50.46 52.36 51.39 54.20 51.61 55.71 105
+aa-aos 91.40 66.99 78.47 83.38 86.62 88.62 92.98 91.54 97.08 94.93 98.90 168
+aa-soa 83.01 69.96 75.85 77.72 79.01 79.29 82.38 80.11 85.70 83.91 87.69 168
+aa-vec-soa 112.03 96.52 105.32 109.76 112.55 113.82 120.55 118.37 126.30 121.37 131.94 168
+list-push-aos 75.13 74.18 75.20 75.42 75.24 75.99 75.80 75.80 75.54 76.22 76.21 97
+list-push-soa 40.99 38.14 39.00 38.89 38.89 39.67 39.87 39.28 39.35 40.08 40.13 97
+list-pull-aos 82.07 82.88 83.29 83.09 83.32 83.49 82.82 82.88 83.32 82.60 82.93 97
+list-pull-soa 62.07 60.40 61.89 61.39 62.43 60.90 60.48 62.80 62.50 61.10 60.38 97
+list-pull-split-nt-1s-soa 125.81 120.60 121.96 122.34 122.86 123.53 123.64 123.67 125.94 124.09 123.69 128
+list-pull-split-nt-2s-soa 122.79 117.16 118.86 119.16 119.56 119.99 120.01 120.03 122.64 120.57 120.39 128
+list-aa-aos 128.13 127.41 129.31 129.07 129.79 129.63 129.67 129.94 129.12 128.41 129.72 150
+list-aa-soa 141.60 139.78 141.58 142.16 141.94 141.31 142.37 142.25 142.43 141.40 142.26 150
+list-aa-ria-soa 141.82 134.88 140.15 140.72 141.67 140.51 141.18 141.29 142.97 141.94 143.25 168
+list-aa-pv-soa 164.79 140.95 159.24 161.78 162.40 163.04 164.69 164.38 165.11 165.75 166.09 168
+========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= =======
+
+
+Skylake, Intel Xeon Gold 6148
+-----------------------------
+
+- Skylake architecture, AVX2, FMA, AVX512
+- 20 cores, 2.4 GHz
+- SMT enabled
+
+memory bandwidth:
+
+- copy-19 89.7 GB/s
+- copy-19-nt-sl 92.4 GB/s
+- update-19 93.6 GB/s
+
+geometry dimensions: 500x100x100
+
+
+========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ===
+kernel pipe blocks-2 blocks-4 blocks-6 blocks-8 blocks-10 blocks-15 blocks-16 blocks-20 blocks-25 blocks-32 RFM
+========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ===
+blk-push-aos 113.01 93.99 108.98 114.65 117.87 119.47 124.95 122.46 129.29 123.87 133.01 197
+blk-push-soa 100.21 98.87 103.63 105.56 107.02 107.27 111.61 109.83 116.16 110.51 110.29 197
+blk-pull-aos 118.45 102.54 114.12 117.82 122.69 124.31 130.58 127.85 135.72 129.65 139.94 197
+blk-pull-soa 82.60 83.36 87.13 88.39 88.84 88.96 92.48 90.93 95.79 91.92 98.64 197
+aa-aos 171.32 125.43 147.73 157.70 163.35 167.25 175.39 174.20 182.54 173.67 187.76 308
+aa-soa 180.85 152.39 165.84 152.59 171.90 175.76 184.94 182.34 189.43 180.30 193.54 308
+aa-vec-soa 208.03 181.51 195.86 203.41 209.08 212.34 224.05 219.49 234.31 225.92 245.22 308
+list-push-aos 158.81 164.67 162.93 163.05 165.22 164.31 164.66 160.78 164.07 165.19 164.06 177
+list-push-soa 134.60 110.44 110.17 132.01 132.95 133.46 134.37 134.33 135.12 134.91 137.87 177
+list-pull-aos 169.61 170.03 170.89 170.90 171.20 171.60 172.09 171.95 169.48 172.08 171.02 177
+list-pull-soa 120.50 116.73 118.62 118.00 120.99 118.15 117.17 121.41 120.83 120.00 118.74 177
+list-pull-split-nt-1s-soa 225.59 224.18 225.10 226.34 226.01 230.37 227.50 228.42 227.39 231.65 227.35 246
+list-pull-split-nt-2s-soa 219.20 214.63 217.61 218.13 219.07 221.01 219.88 220.09 220.62 221.68 220.58 246
+list-aa-aos 241.39 239.27 239.53 242.56 242.46 243.00 242.91 242.46 241.24 242.96 241.52 275
+list-aa-soa 273.73 268.49 268.48 271.79 275.29 274.56 277.18 272.67 274.21 275.24 278.21 275
+list-aa-ria-soa 288.42 261.89 273.26 284.84 283.88 288.29 290.72 289.81 293.36 290.75 292.93 308
+list-aa-pv-soa 303.35 267.21 289.18 294.96 294.36 298.16 300.45 301.71 302.37 302.88 304.46 308
+========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ===
Licence
=======
This work was funded by KONWHIR project OMI4PAPS.
+Bibliography
+============
+
+.. [ginzburg-2008]
+ I. Ginzburg, F. Verhaeghe, and D. d'Humières.
+ Two-relaxation-time lattice Boltzmann scheme: About parametrization, velocity, pressure and mixed boundary conditions.
+ Commun. Comput. Phys., 3(2):427-478, 2008.
+
+.. [williams-2008]
+ S. Williams, A. Waterman, and D. Patterson.
+ Roofline: an insightful visual performance model for multicore architectures.
+ Commun. ACM, 52(4):65-76, Apr 2009. doi:10.1145/1498765.1498785
+
.. |datetime| date:: %Y-%m-%d %H:%M