X-Git-Url: http://git.rrze.uni-erlangen.de/gitweb/?p=LbmBenchmarkKernelsPublic.git;a=blobdiff_plain;f=doc%2Fmain.rst;fp=doc%2Fmain.rst;h=0eaa5ede892a7ace3ef476479acc3dfbe58ebe3b;hp=528b339be1c37a3bd522bba6256905b16d916428;hb=0095f461c30075a883df0265a7b831061ee7ebee;hpb=e3f82424829ebb623343ce0092238f83b4a1b8c2 diff --git a/doc/main.rst b/doc/main.rst index 528b339..0eaa5ed 100644 --- a/doc/main.rst +++ b/doc/main.rst @@ -35,12 +35,26 @@ LBM Benchmark Kernels Documentation .. sectnum:: .. contents:: +Introduction +============ + +The lattice Boltzmann (LBM) benchmark kernels are a collection of LBM kernel +implementations. + +**AS SUCH THE LBM BENCHMARK KERNELS ARE NO FULLY EQUIPPED CFD SOLVER AND SOLELY +SERVES THE PURPOSE OF STUDYING POSSIBLE PERFORMANCE OPTIMIZATIONS AND/OR +EXPERIMENTS.** + +Currently all kernels utilize a D3Q19 discretization and the +two-relaxation-time (TRT) collision operator [ginzburg-2008]_. +All operations are carried out in double precision arithmetic. + Compilation =========== The benchmark framework currently supports only Linux systems and the GCC and Intel compilers. Every other configuration probably requires adjustment inside -the code and the makefiles. Further some code might be platform or at least +the code and the makefiles. Furthermore some code might be platform or at least POSIX specific. The benchmark can be build via ``make`` from the ``src`` subdirectory. This will @@ -49,6 +63,11 @@ generate one binary which hosts all implemented benchmark kernels. Binaries are located under the ``bin`` subdirectory and will have different names depending on compiler and build configuration. +Compilation can target debug or release builds. Combined with both build types +verification can be enabled, which increases the runtime and hence is not +suited for benchmarking. + + Debug and Verification ---------------------- @@ -69,6 +88,16 @@ Specifying ``BENCHMARK=off`` turns on verification Please note that the generated binary will therefore exhibit a poor performance. + +Release and Verification +------------------------ + +Verification with the debug builds can be extremely slow. Hence verification +capabilities can be build with release builds: :: + + make BENCHMARK=off + + Benchmarking ------------ @@ -77,16 +106,11 @@ To generate a binary for benchmarking run make with :: make As default ``BENCHMARK=on`` and ``BUILD=release`` is set, where -BUILD=release turns optimizations on and ``BENCHMARK=on`` disables +``BUILD=release`` turns optimizations on and ``BENCHMARK=on`` disables verfification, statistics, and VTK output. -Release and Verification ------------------------- - -Verification with the debug builds can be extremely slow. Hence verification -capabilities can be build with release builds: :: - - make BENCHMARK=off +See Options Summary below for further description of options which can be +applied, e.g. TARCH as well as the Benchmarking section. Compilers --------- @@ -116,15 +140,15 @@ all object and dependency files are deleted. Options Summary --------------- -Options that can be specified when building the framework with make: +Options that can be specified when building the suite with make: ============= ======================= ============ ========================================================== name values default description -------------- ----------------------- ------------ ---------------------------------------------------------- +============= ======================= ============ ========================================================== BENCHMARK on, off on If enabled, disables VERIFICATION, STATISTICS, VTK_OUTPUT. If disabled enables the three former options. -BUILD debug, release release No optimization, debug symbols, DEBUG defined. +BUILD debug, release release debug: no optimization, debug symbols, DEBUG defined. release: optimizations enabled. CONFIG linux-gcc, linux-intel linux-intel Select GCC or Intel compiler. -ISA avx, sse avx Determines which ISA extension is used for macro definitions. This is *not* the architecture the compiler generates code for. +ISA avx, sse avx Determines which ISA extension is used for macro definitions of the intrinsics. This is *not* the architecture the compiler generates code for. OPENMP on, off on OpenMP, i.\,e.\. threading support. STATISTICS on, off off View statistics, like density etc, during simulation. TARCH -- -- Via TARCH the architecture the compiler generates code for can be overridden. The value depends on the chosen compiler. @@ -175,7 +199,7 @@ iterations, etc, which can afterward be override, e.g.: :: $ bin/lbmbenchk-linux-intel-release -verfiy -dims 32x32x32 -Kernel specific parameters can be opatained via selecting the specific kernel +Kernel specific parameters can be obtained via selecting the specific kernel and passing ``-h`` as parameter: :: $ bin/lbmbenchk-linux-intel-release -kernel kernel-name -- -h @@ -299,13 +323,51 @@ created make sure the binary was compiled with: - ``BUILD=release`` (default if not overriden) and - the correct ISA for macros is used, selected via ``ISA`` and - use ``TARCH`` to specify the architecture the compiler generates code for. + +Intel Compiler +-------------- + +For the Intel compiler one can specify depending on the target ISA extension: + +- AVX: ``TARCH=-xAVX`` +- AVX2 and FMA: ``TARCH=-xCORE-AVX2,-fma`` +- AVX512: ``TARCH=-xCORE-AVX512`` +- KNL: ``TARCH=-xMIC-AVX512`` + +Compiling for an architecture supporting AVX (Sandy Bridge, Ivy Bridge): :: + + make ISA=avx TARCH=-xAVX + + +Compiling for an architecture supporting AVX2 (Haswell, Broadwell): :: + + make ISA=avx TARCH=-xCORE-AVX2,-fma + +WARNING: ISA is here still set to ``avx`` as currently we have the FMA intrinsics not +implemented. This might change in the future. + + +Compiling for an architecture supporting AVX-512 (Skylake): :: + + make ISA=avx TARCH=-xCORE-AVX512 + +WARNING: ISA is here still set to ``avx`` as currently we have no implementation for the +AVX512 intrinsics. This might change in the future. + + +Pinning +------- During benchmarking pinning should be used via the ``-pin`` parameter. Running -a benchmark with 10 threads an pin them to the first 10 cores works like :: +a benchmark with 10 threads and pin them to the first 10 cores works like :: $ bin/lbmbenchk-linux-intel-release ... -t 10 -pin $(seq -s , 0 9) -Things the binary does nor check or controll: + +General Remarks +--------------- + +Things the binary does nor check or control: - transparent huge pages: when allocating memory small 4 KiB pages might be replaced with larger ones. This is in general a good thing, but if this is @@ -326,7 +388,7 @@ Things the binary does nor check or controll: used is already full memory might be allocated in a remote domain. Accesses to remote domains typically have a higher latency and lower bandwidth. -- System load: interference with other application, espcially on desktop +- System load: interference with other application, especially on desktop systems should be avoided. - Padding: For SoA based kernels the number of (fluid) nodes is automatically @@ -378,14 +440,134 @@ like ``-pad 0+16`` would at a static padding of two nodes (one node = 8 b). Geometries ========== -TODO: supported geometries: channel, pipe, blocks - - -Results -======= - -TODO - +TODO: supported geometries: channel, pipe, blocks, fluid + + +Performance Results +=================== + +The sections lists performance values measured on several machines for +different kernels and geometries. +The **RFM** column denotes the expected performance as predicted by the +Roofline performance model [williams-2008]_. +For performance prediction of each kernel a memory bandwidth benchmark is used +which mimics the kernels memory access pattern and the kernel's loop balance +(see [kernels]_ for details). + +Haswell, Intel Xeon E5-2695 v3 +------------------------------ + +- Haswell architecture, AVX2, FMA +- 14 cores, 2,3 GHz +- 2 x 7 cores in cluster-on-die (CoD) mode enabled +- SMT enabled + +memory bandwidth: + +- copy-19 47.3 GB/s +- copy-19-nt-sl 47.1 GB/s +- update-19 44.0 GB/s + +geometry dimensions: 500x100x100 + +========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ===== +kernel pipe blocks-2 blocks-4 blocks-6 blocks-8 blocks-10 blocks-15 blocks-16 blocks-20 blocks-25 blocks-32 RFM +========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ===== +blk-push-aos 58.82 49.85 57.34 59.90 61.37 62.17 65.30 64.00 67.54 64.46 69.69 104 +blk-push-soa 32.32 33.46 34.02 34.64 35.06 35.04 36.31 35.44 37.20 35.14 37.95 104 +blk-pull-aos 56.97 51.41 56.09 57.92 59.98 59.83 63.37 61.55 65.50 63.11 67.02 104 +blk-pull-soa 49.29 46.23 47.50 51.97 51.27 49.52 55.23 53.13 54.50 49.79 57.90 104 +aa-aos 91.35 66.14 76.80 84.76 83.63 91.36 93.46 92.62 93.91 92.25 92.93 145 +aa-soa 75.51 65.68 70.94 71.36 73.83 75.46 74.84 79.48 83.28 77.70 82.72 145 +aa-vec-soa 93.85 83.44 91.58 93.96 94.35 96.62 101.76 96.72 106.37 102.60 110.28 145 +list-push-aos 80.29 80.97 80.95 81.10 81.37 82.44 81.77 81.49 80.72 81.93 80.93 83 +list-push-soa 47.52 42.65 45.28 46.64 43.46 40.59 44.94 46.55 41.53 45.98 44.86 83 +list-pull-aos 85.30 82.97 86.43 83.42 86.33 83.70 86.43 83.77 83.10 85.89 84.44 83 +list-pull-soa 62.12 63.61 63.28 61.32 66.72 62.65 64.82 60.49 58.01 64.46 62.52 83 +list-pull-split-nt-1s-soa 121.35 113.77 115.29 113.54 117.00 116.46 114.78 114.54 110.83 112.67 117.85 125 +list-pull-split-nt-2s-soa 118.09 110.48 112.55 113.18 113.44 111.85 109.27 114.41 110.28 111.78 113.74 125 +list-aa-aos 121.28 118.63 119.00 118.50 121.99 119.11 118.83 121.47 121.62 126.18 120.12 129 +list-aa-soa 126.34 116.90 129.45 127.12 129.41 121.42 126.19 126.76 126.70 124.40 125.22 129 +list-aa-ria-soa 133.68 121.82 126.04 128.46 131.15 132.25 128.78 133.50 126.69 124.40 130.37 145 +list-aa-pv-soa 146.22 124.39 130.73 136.29 137.61 131.21 138.65 138.78 127.02 132.40 138.37 145 +========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ===== + + +Broadwell, Intel Xeon E5-2630 v4 +-------------------------------- + +- Broadwell architecture, AVX2, FMA +- 10 cores, 2.2 GHz +- SMT disabled + +memory bandwidth: + +- copy-19 48.0 GB/s +- copy-nt-sl-19 48.2 GB/s +- update-19 51.1 GB/s + +geometry dimensions: 500x100x100 + +========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ======= +kernel pipe blocks-2 blocks-4 blocks-6 blocks-8 blocks-10 blocks-15 blocks-16 blocks-20 blocks-25 blocks-32 RFM +========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ======= +blk-push-aos 55.75 47.62 54.57 57.10 58.49 59.00 61.72 60.56 64.05 61.10 66.03 105 +blk-push-soa 30.06 31.09 32.13 32.54 32.74 32.72 33.81 33.19 34.90 33.21 35.75 105 +blk-pull-aos 53.80 48.61 53.08 54.99 56.08 56.68 59.20 58.12 61.49 58.71 63.45 105 +blk-pull-soa 46.96 46.61 48.84 49.70 50.33 50.46 52.36 51.39 54.20 51.61 55.71 105 +aa-aos 91.40 66.99 78.47 83.38 86.62 88.62 92.98 91.54 97.08 94.93 98.90 168 +aa-soa 83.01 69.96 75.85 77.72 79.01 79.29 82.38 80.11 85.70 83.91 87.69 168 +aa-vec-soa 112.03 96.52 105.32 109.76 112.55 113.82 120.55 118.37 126.30 121.37 131.94 168 +list-push-aos 75.13 74.18 75.20 75.42 75.24 75.99 75.80 75.80 75.54 76.22 76.21 97 +list-push-soa 40.99 38.14 39.00 38.89 38.89 39.67 39.87 39.28 39.35 40.08 40.13 97 +list-pull-aos 82.07 82.88 83.29 83.09 83.32 83.49 82.82 82.88 83.32 82.60 82.93 97 +list-pull-soa 62.07 60.40 61.89 61.39 62.43 60.90 60.48 62.80 62.50 61.10 60.38 97 +list-pull-split-nt-1s-soa 125.81 120.60 121.96 122.34 122.86 123.53 123.64 123.67 125.94 124.09 123.69 128 +list-pull-split-nt-2s-soa 122.79 117.16 118.86 119.16 119.56 119.99 120.01 120.03 122.64 120.57 120.39 128 +list-aa-aos 128.13 127.41 129.31 129.07 129.79 129.63 129.67 129.94 129.12 128.41 129.72 150 +list-aa-soa 141.60 139.78 141.58 142.16 141.94 141.31 142.37 142.25 142.43 141.40 142.26 150 +list-aa-ria-soa 141.82 134.88 140.15 140.72 141.67 140.51 141.18 141.29 142.97 141.94 143.25 168 +list-aa-pv-soa 164.79 140.95 159.24 161.78 162.40 163.04 164.69 164.38 165.11 165.75 166.09 168 +========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ======= + + +Skylake, Intel Xeon Gold 6148 +----------------------------- + +- Skylake architecture, AVX2, FMA, AVX512 +- 20 cores, 2.4 GHz +- SMT enabled + +memory bandwidth: + +- copy-19 89.7 GB/s +- copy-19-nt-sl 92.4 GB/s +- update-19 93.6 GB/s + +geometry dimensions: 500x100x100 + + +========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= === +kernel pipe blocks-2 blocks-4 blocks-6 blocks-8 blocks-10 blocks-15 blocks-16 blocks-20 blocks-25 blocks-32 RFM +========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= === +blk-push-aos 113.01 93.99 108.98 114.65 117.87 119.47 124.95 122.46 129.29 123.87 133.01 197 +blk-push-soa 100.21 98.87 103.63 105.56 107.02 107.27 111.61 109.83 116.16 110.51 110.29 197 +blk-pull-aos 118.45 102.54 114.12 117.82 122.69 124.31 130.58 127.85 135.72 129.65 139.94 197 +blk-pull-soa 82.60 83.36 87.13 88.39 88.84 88.96 92.48 90.93 95.79 91.92 98.64 197 +aa-aos 171.32 125.43 147.73 157.70 163.35 167.25 175.39 174.20 182.54 173.67 187.76 308 +aa-soa 180.85 152.39 165.84 152.59 171.90 175.76 184.94 182.34 189.43 180.30 193.54 308 +aa-vec-soa 208.03 181.51 195.86 203.41 209.08 212.34 224.05 219.49 234.31 225.92 245.22 308 +list-push-aos 158.81 164.67 162.93 163.05 165.22 164.31 164.66 160.78 164.07 165.19 164.06 177 +list-push-soa 134.60 110.44 110.17 132.01 132.95 133.46 134.37 134.33 135.12 134.91 137.87 177 +list-pull-aos 169.61 170.03 170.89 170.90 171.20 171.60 172.09 171.95 169.48 172.08 171.02 177 +list-pull-soa 120.50 116.73 118.62 118.00 120.99 118.15 117.17 121.41 120.83 120.00 118.74 177 +list-pull-split-nt-1s-soa 225.59 224.18 225.10 226.34 226.01 230.37 227.50 228.42 227.39 231.65 227.35 246 +list-pull-split-nt-2s-soa 219.20 214.63 217.61 218.13 219.07 221.01 219.88 220.09 220.62 221.68 220.58 246 +list-aa-aos 241.39 239.27 239.53 242.56 242.46 243.00 242.91 242.46 241.24 242.96 241.52 275 +list-aa-soa 273.73 268.49 268.48 271.79 275.29 274.56 277.18 272.67 274.21 275.24 278.21 275 +list-aa-ria-soa 288.42 261.89 273.26 284.84 283.88 288.29 290.72 289.81 293.36 290.75 292.93 308 +list-aa-pv-soa 303.35 267.21 289.18 294.96 294.36 298.16 300.45 301.71 302.37 302.88 304.46 308 +========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= === Licence ======= @@ -401,6 +583,19 @@ This work was funded by BMBF, grant no. 01IH15003A (project SKAMPY). This work was funded by KONWHIR project OMI4PAPS. +Bibliography +============ + +.. [ginzburg-2008] + I. Ginzburg, F. Verhaeghe, and D. d'Humières. + Two-relaxation-time lattice Boltzmann scheme: About parametrization, velocity, pressure and mixed boundary conditions. + Commun. Comput. Phys., 3(2):427-478, 2008. + +.. [williams-2008] + S. Williams, A. Waterman, and D. Patterson. + Roofline: an insightful visual performance model for multicore architectures. + Commun. ACM, 52(4):65-76, Apr 2009. doi:10.1145/1498765.1498785 + .. |datetime| date:: %Y-%m-%d %H:%M