From 0095f461c30075a883df0265a7b831061ee7ebee Mon Sep 17 00:00:00 2001 From: Markus Wittmann Date: Tue, 21 Nov 2017 15:46:25 +0100 Subject: [PATCH] update README and doc --- README | 68 +-- doc/html/main.html | 1072 +++++++++++++++++++++++++++++++++++++++++--- doc/main.rst | 245 ++++++++-- 3 files changed, 1265 insertions(+), 120 deletions(-) diff --git a/README b/README index bf580c8..d548b38 100644 --- a/README +++ b/README @@ -1,34 +1,34 @@ -# -------------------------------------------------------------------------- -# -# Copyright -# Markus Wittmann, 2016-2017 -# RRZE, University of Erlangen-Nuremberg, Germany -# markus.wittmann -at- fau.de or hpc -at- rrze.fau.de -# -# Viktor Haag, 2016 -# LSS, University of Erlangen-Nuremberg, Germany -# -# This file is part of the Lattice Boltzmann Benchmark Kernels (LbmBenchKernels). -# -# LbmBenchKernels is free software: you can redistribute it and/or modify -# it under the terms of the GNU General Public License as published by -# the Free Software Foundation, either version 3 of the License, or -# (at your option) any later version. -# -# LbmBenchKernels is distributed in the hope that it will be useful, -# but WITHOUT ANY WARRANTY; without even the implied warranty of -# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the -# GNU General Public License for more details. -# -# You should have received a copy of the GNU General Public License -# along with LbmBenchKernels. If not, see . -# -# -------------------------------------------------------------------------- - -# See licence of src/BoostJoin.h in the top of src/BoostJoin.h, which stems from -# the boost projects and is licenced under the Boost licence. - -See doc subdirectory for a rudimentary documentation. - -This work was funded by BMBF, grant no. 01IH15003A (project SKAMPY) -This work was funded by KONWHIR project OMI4PAPS +Lattice Boltzmann Benchmark Kernels +=================================== + +Simple benchmark suite for LBM kernels and performance experiments from the HPC +group of Erlangen Regional Computing Center (RRZE/HPC) [1] and the Chair for +System Simulation of FAU (LSS) [2]. + +See doc/main.rst or doc/html/main.html subdirectories for rudimentary +documentation. + + + +LICENCE +------- +LBM benchmark kernels are licensed under GPLv3. + +See licence of src/BoostJoin.h in the top of src/BoostJoin.h, which stems from +the boost projects and is licenced under the Boost licence. + + + +ACKNOWLEDGEMENTS +---------------- + +This work was funded by: + + - BMBF, grant no. 01IH15003A (project SKAMPY) + - KONWHIR project OMI4PAPS + + +[1] https://hpc.fau.de/ +[2] https://www10.informatik.uni-erlangen.de/en/ + + diff --git a/doc/html/main.html b/doc/html/main.html index 511f6b2..99f4cb8 100644 --- a/doc/html/main.html +++ b/doc/html/main.html @@ -420,42 +420,66 @@ tr:nth-child(odd) { +
+

1   Introduction

+

The lattice Boltzmann (LBM) benchmark kernels are a collection of LBM kernel +implementations.

+

AS SUCH THE LBM BENCHMARK KERNELS ARE NO FULLY EQUIPPED CFD SOLVER AND SOLELY +SERVES THE PURPOSE OF STUDYING POSSIBLE PERFORMANCE OPTIMIZATIONS AND/OR +EXPERIMENTS.

+

Currently all kernels utilize a D3Q19 discretization and the +two-relaxation-time (TRT) collision operator [ginzburg-2008]. +All operations are carried out in double precision arithmetic.

+
-

1   Compilation

+

2   Compilation

The benchmark framework currently supports only Linux systems and the GCC and Intel compilers. Every other configuration probably requires adjustment inside -the code and the makefiles. Further some code might be platform or at least +the code and the makefiles. Furthermore some code might be platform or at least POSIX specific.

The benchmark can be build via make from the src subdirectory. This will generate one binary which hosts all implemented benchmark kernels.

Binaries are located under the bin subdirectory and will have different names depending on compiler and build configuration.

+

Compilation can target debug or release builds. Combined with both build types +verification can be enabled, which increases the runtime and hence is not +suited for benchmarking.

-

1.1   Debug and Verification

+

2.1   Debug and Verification

 make BUILD=debug BENCHMARK=off
 
@@ -470,32 +494,34 @@ binary will be found in the bin subdirectory a

Please note that the generated binary will therefore exhibit a poor performance.

+
+

2.2   Release and Verification

+

Verification with the debug builds can be extremely slow. Hence verification +capabilities can be build with release builds:

+
+make BENCHMARK=off
+
+
-

1.2   Benchmarking

+

2.3   Benchmarking

To generate a binary for benchmarking run make with

 make
 

As default BENCHMARK=on and BUILD=release is set, where -BUILD=release turns optimizations on and BENCHMARK=on disables +BUILD=release turns optimizations on and BENCHMARK=on disables verfification, statistics, and VTK output.

-
-
-

1.3   Release and Verification

-

Verification with the debug builds can be extremely slow. Hence verification -capabilities can be build with release builds:

-
-make BENCHMARK=off
-
+

See Options Summary below for further description of options which can be +applied, e.g. TARCH as well as the Benchmarking section.

-

1.4   Compilers

+

2.4   Compilers

Currently only the GCC and Intel compiler under Linux are supported. Between both configuration can be chosen via CONFIG=linux-gcc or CONFIG=linux-intel.

-

1.5   Cleaning

+

2.5   Cleaning

For each configuration and build (debug/release) a subdirectory under the src/obj directory is created where the dependency and object files are stored. @@ -510,21 +536,23 @@ make clean-all

all object and dependency files are deleted.

-

1.6   Options Summary

-

Options that can be specified when building the framework with make:

+

2.6   Options Summary

+

Options that can be specified when building the suite with make:

---+++ - - - - - + + + + + + + @@ -533,7 +561,7 @@ make clean-all - + @@ -543,7 +571,7 @@ make clean-all - + @@ -575,7 +603,7 @@ make clean-all
-

2   Invocation

+

3   Invocation

Running the binary will print among the GPL licence header a line like the following:

 LBM Benchmark Kernels 0.1, compiled Jul  5 2017 21:59:22, type: verification
@@ -586,7 +614,7 @@ LBM Benchmark Kernels 0.1, compiled Jul  5 2017 21:59:22, type: benchmark
 

if verfication was disabled during compilation.

-

2.1   Command Line Parameters

+

3.1   Command Line Parameters

Running the binary with -h list all available parameters:

 Usage:
@@ -613,7 +641,7 @@ iterations, etc, which can afterward be override, e.g.:

 $ bin/lbmbenchk-linux-intel-release -verfiy -dims 32x32x32
 
-

Kernel specific parameters can be opatained via selecting the specific kernel +

Kernel specific parameters can be obtained via selecting the specific kernel and passing -h as parameter:

 $ bin/lbmbenchk-linux-intel-release -kernel kernel-name -- -h
@@ -651,7 +679,7 @@ Available kernels to benchmark:
 
-

2.2   Kernels

+

3.2   Kernels

The following list shortly describes available kernels:

  • push-soa/push-aos/pull-soa/pull-aos: @@ -862,8 +890,8 @@ during each run.

namevaluesdefaultdescription
namevaluesdefaultdescription
BENCHMARK on, off on
BUILD debug, release releaseNo optimization, debug symbols, DEBUG defined.debug: no optimization, debug symbols, DEBUG defined. release: optimizations enabled.
CONFIG linux-gcc, linux-intel
ISA avx, sse avxDetermines which ISA extension is used for macro definitions. This is not the architecture the compiler generates code for.Determines which ISA extension is used for macro definitions of the intrinsics. This is not the architecture the compiler generates code for.
OPENMP on, off
-
-

3   Benchmarking

+
+

4   Benchmarking

Correct benchmarking is a nontrivial task. Whenever benchmark results should be created make sure the binary was compiled with:

    @@ -872,12 +900,43 @@ created make sure the binary was compiled with:

  • the correct ISA for macros is used, selected via ISA and
  • use TARCH to specify the architecture the compiler generates code for.
+
+

4.1   Intel Compiler

+

For the Intel compiler one can specify depending on the target ISA extension:

+
    +
  • AVX: TARCH=-xAVX
  • +
  • AVX2 and FMA: TARCH=-xCORE-AVX2,-fma
  • +
  • AVX512: TARCH=-xCORE-AVX512
  • +
  • KNL: TARCH=-xMIC-AVX512
  • +
+

Compiling for an architecture supporting AVX (Sandy Bridge, Ivy Bridge):

+
+make ISA=avx TARCH=-xAVX
+
+

Compiling for an architecture supporting AVX2 (Haswell, Broadwell):

+
+make ISA=avx TARCH=-xCORE-AVX2,-fma
+
+

WARNING: ISA is here still set to avx as currently we have the FMA intrinsics not +implemented. This might change in the future.

+

Compiling for an architecture supporting AVX-512 (Skylake):

+
+make ISA=avx TARCH=-xCORE-AVX512
+
+

WARNING: ISA is here still set to avx as currently we have no implementation for the +AVX512 intrinsics. This might change in the future.

+
+
+

4.2   Pinning

During benchmarking pinning should be used via the -pin parameter. Running -a benchmark with 10 threads an pin them to the first 10 cores works like

+a benchmark with 10 threads and pin them to the first 10 cores works like

 $ bin/lbmbenchk-linux-intel-release ... -t 10 -pin $(seq -s , 0 9)
 
-

Things the binary does nor check or controll:

+
+
+

4.3   General Remarks

+

Things the binary does nor check or control:

  • transparent huge pages: when allocating memory small 4 KiB pages might be replaced with larger ones. This is in general a good thing, but if this is @@ -895,7 +954,7 @@ means the memory will be placed at the NUMA domain the touching core is associated with. If a different policy is in place or the NUMA domain to be used is already full memory might be allocated in a remote domain. Accesses to remote domains typically have a higher latency and lower bandwidth.
  • -
  • System load: interference with other application, espcially on desktop +
  • System load: interference with other application, especially on desktop systems should be avoided.
  • Padding: For SoA based kernels the number of (fluid) nodes is automatically adjusted so that no cache or TLB thrashing should occur. The parameters are @@ -905,8 +964,9 @@ padding section.
  • function for different ISA extensions. Make sure the code you might think is executed is actually the code which is executed.
+
-

3.1   Padding

+

4.4   Padding

With correct padding cache and TLB thrashing can be avoided. Therefore the number of (fluid) nodes used in the data layout is artificially increased.

Currently automatic padding is active for kernels which support it. It can be @@ -938,22 +998,912 @@ like -pad 0+16 would

-

4   Geometries

-

TODO: supported geometries: channel, pipe, blocks

+

5   Geometries

+

TODO: supported geometries: channel, pipe, blocks, fluid

+
+
+

6   Performance Results

+

The sections lists performance values measured on several machines for +different kernels and geometries. +The RFM column denotes the expected performance as predicted by the +Roofline performance model [williams-2008]. +For performance prediction of each kernel a memory bandwidth benchmark is used +which mimics the kernels memory access pattern and the kernel's loop balance +(see [kernels] for details).

+
+

6.1   Haswell, Intel Xeon E5-2695 v3

+
    +
  • Haswell architecture, AVX2, FMA
  • +
  • 14 cores, 2,3 GHz
  • +
  • 2 x 7 cores in cluster-on-die (CoD) mode enabled
  • +
  • SMT enabled
  • +
+

memory bandwidth:

+
    +
  • copy-19 47.3 GB/s
  • +
  • copy-19-nt-sl 47.1 GB/s
  • +
  • update-19 44.0 GB/s
  • +
+

geometry dimensions: 500x100x100

+ +++++++++++++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
kernelpipeblocks-2blocks-4blocks-6blocks-8blocks-10blocks-15blocks-16blocks-20blocks-25blocks-32RFM
blk-push-aos58.8249.8557.3459.9061.3762.1765.3064.0067.5464.4669.69104
blk-push-soa32.3233.4634.0234.6435.0635.0436.3135.4437.2035.1437.95104
blk-pull-aos56.9751.4156.0957.9259.9859.8363.3761.5565.5063.1167.02104
blk-pull-soa49.2946.2347.5051.9751.2749.5255.2353.1354.5049.7957.90104
aa-aos91.3566.1476.8084.7683.6391.3693.4692.6293.9192.2592.93145
aa-soa75.5165.6870.9471.3673.8375.4674.8479.4883.2877.7082.72145
aa-vec-soa93.8583.4491.5893.9694.3596.62101.7696.72106.37102.60110.28145
list-push-aos80.2980.9780.9581.1081.3782.4481.7781.4980.7281.9380.9383
list-push-soa47.5242.6545.2846.6443.4640.5944.9446.5541.5345.9844.8683
list-pull-aos85.3082.9786.4383.4286.3383.7086.4383.7783.1085.8984.4483
list-pull-soa62.1263.6163.2861.3266.7262.6564.8260.4958.0164.4662.5283
list-pull-split-nt-1s-soa121.35113.77115.29113.54117.00116.46114.78114.54110.83112.67117.85125
list-pull-split-nt-2s-soa118.09110.48112.55113.18113.44111.85109.27114.41110.28111.78113.74125
list-aa-aos121.28118.63119.00118.50121.99119.11118.83121.47121.62126.18120.12129
list-aa-soa126.34116.90129.45127.12129.41121.42126.19126.76126.70124.40125.22129
list-aa-ria-soa133.68121.82126.04128.46131.15132.25128.78133.50126.69124.40130.37145
list-aa-pv-soa146.22124.39130.73136.29137.61131.21138.65138.78127.02132.40138.37145
+
+
+

6.2   Broadwell, Intel Xeon E5-2630 v4

+
    +
  • Broadwell architecture, AVX2, FMA
  • +
  • 10 cores, 2.2 GHz
  • +
  • SMT disabled
  • +
+

memory bandwidth:

+
    +
  • copy-19 48.0 GB/s
  • +
  • copy-nt-sl-19 48.2 GB/s
  • +
  • update-19 51.1 GB/s
  • +
+

geometry dimensions: 500x100x100

+ +++++++++++++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
kernelpipeblocks-2blocks-4blocks-6blocks-8blocks-10blocks-15blocks-16blocks-20blocks-25blocks-32RFM
blk-push-aos55.7547.6254.5757.1058.4959.0061.7260.5664.0561.1066.03105
blk-push-soa30.0631.0932.1332.5432.7432.7233.8133.1934.9033.2135.75105
blk-pull-aos53.8048.6153.0854.9956.0856.6859.2058.1261.4958.7163.45105
blk-pull-soa46.9646.6148.8449.7050.3350.4652.3651.3954.2051.6155.71105
aa-aos91.4066.9978.4783.3886.6288.6292.9891.5497.0894.9398.90168
aa-soa83.0169.9675.8577.7279.0179.2982.3880.1185.7083.9187.69168
aa-vec-soa112.0396.52105.32109.76112.55113.82120.55118.37126.30121.37131.94168
list-push-aos75.1374.1875.2075.4275.2475.9975.8075.8075.5476.2276.2197
list-push-soa40.9938.1439.0038.8938.8939.6739.8739.2839.3540.0840.1397
list-pull-aos82.0782.8883.2983.0983.3283.4982.8282.8883.3282.6082.9397
list-pull-soa62.0760.4061.8961.3962.4360.9060.4862.8062.5061.1060.3897
list-pull-split-nt-1s-soa125.81120.60121.96122.34122.86123.53123.64123.67125.94124.09123.69128
list-pull-split-nt-2s-soa122.79117.16118.86119.16119.56119.99120.01120.03122.64120.57120.39128
list-aa-aos128.13127.41129.31129.07129.79129.63129.67129.94129.12128.41129.72150
list-aa-soa141.60139.78141.58142.16141.94141.31142.37142.25142.43141.40142.26150
list-aa-ria-soa141.82134.88140.15140.72141.67140.51141.18141.29142.97141.94143.25168
list-aa-pv-soa164.79140.95159.24161.78162.40163.04164.69164.38165.11165.75166.09168
+
+
+

6.3   Skylake, Intel Xeon Gold 6148

+
    +
  • Skylake architecture, AVX2, FMA, AVX512
  • +
  • 20 cores, 2.4 GHz
  • +
  • SMT enabled
  • +
+

memory bandwidth:

+
    +
  • copy-19 89.7 GB/s
  • +
  • copy-19-nt-sl 92.4 GB/s
  • +
  • update-19 93.6 GB/s
  • +
+

geometry dimensions: 500x100x100

+ +++++++++++++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
kernelpipeblocks-2blocks-4blocks-6blocks-8blocks-10blocks-15blocks-16blocks-20blocks-25blocks-32RFM
blk-push-aos113.0193.99108.98114.65117.87119.47124.95122.46129.29123.87133.01197
blk-push-soa100.2198.87103.63105.56107.02107.27111.61109.83116.16110.51110.29197
blk-pull-aos118.45102.54114.12117.82122.69124.31130.58127.85135.72129.65139.94197
blk-pull-soa82.6083.3687.1388.3988.8488.9692.4890.9395.7991.9298.64197
aa-aos171.32125.43147.73157.70163.35167.25175.39174.20182.54173.67187.76308
aa-soa180.85152.39165.84152.59171.90175.76184.94182.34189.43180.30193.54308
aa-vec-soa208.03181.51195.86203.41209.08212.34224.05219.49234.31225.92245.22308
list-push-aos158.81164.67162.93163.05165.22164.31164.66160.78164.07165.19164.06177
list-push-soa134.60110.44110.17132.01132.95133.46134.37134.33135.12134.91137.87177
list-pull-aos169.61170.03170.89170.90171.20171.60172.09171.95169.48172.08171.02177
list-pull-soa120.50116.73118.62118.00120.99118.15117.17121.41120.83120.00118.74177
list-pull-split-nt-1s-soa225.59224.18225.10226.34226.01230.37227.50228.42227.39231.65227.35246
list-pull-split-nt-2s-soa219.20214.63217.61218.13219.07221.01219.88220.09220.62221.68220.58246
list-aa-aos241.39239.27239.53242.56242.46243.00242.91242.46241.24242.96241.52275
list-aa-soa273.73268.49268.48271.79275.29274.56277.18272.67274.21275.24278.21275
list-aa-ria-soa288.42261.89273.26284.84283.88288.29290.72289.81293.36290.75292.93308
list-aa-pv-soa303.35267.21289.18294.96294.36298.16300.45301.71302.37302.88304.46308
-
-

6   Licence

+

7   Licence

The Lattice Boltzmann Benchmark Kernels are licensed under GPLv3.

-

7   Acknowledgements

+

8   Acknowledgements

This work was funded by BMBF, grant no. 01IH15003A (project SKAMPY).

This work was funded by KONWHIR project OMI4PAPS.

-

Document was generated at 2017-11-02 15:33.

+
+
+

9   Bibliography

+ + + + + +
[ginzburg-2008]I. Ginzburg, F. Verhaeghe, and D. d'Humières. +Two-relaxation-time lattice Boltzmann scheme: About parametrization, velocity, pressure and mixed boundary conditions. +Commun. Comput. Phys., 3(2):427-478, 2008.
+ + + + + +
[williams-2008]S. Williams, A. Waterman, and D. Patterson. +Roofline: an insightful visual performance model for multicore architectures. +Commun. ACM, 52(4):65-76, Apr 2009. doi:10.1145/1498765.1498785
+

Document was generated at 2017-11-21 15:43.

diff --git a/doc/main.rst b/doc/main.rst index 528b339..0eaa5ed 100644 --- a/doc/main.rst +++ b/doc/main.rst @@ -35,12 +35,26 @@ LBM Benchmark Kernels Documentation .. sectnum:: .. contents:: +Introduction +============ + +The lattice Boltzmann (LBM) benchmark kernels are a collection of LBM kernel +implementations. + +**AS SUCH THE LBM BENCHMARK KERNELS ARE NO FULLY EQUIPPED CFD SOLVER AND SOLELY +SERVES THE PURPOSE OF STUDYING POSSIBLE PERFORMANCE OPTIMIZATIONS AND/OR +EXPERIMENTS.** + +Currently all kernels utilize a D3Q19 discretization and the +two-relaxation-time (TRT) collision operator [ginzburg-2008]_. +All operations are carried out in double precision arithmetic. + Compilation =========== The benchmark framework currently supports only Linux systems and the GCC and Intel compilers. Every other configuration probably requires adjustment inside -the code and the makefiles. Further some code might be platform or at least +the code and the makefiles. Furthermore some code might be platform or at least POSIX specific. The benchmark can be build via ``make`` from the ``src`` subdirectory. This will @@ -49,6 +63,11 @@ generate one binary which hosts all implemented benchmark kernels. Binaries are located under the ``bin`` subdirectory and will have different names depending on compiler and build configuration. +Compilation can target debug or release builds. Combined with both build types +verification can be enabled, which increases the runtime and hence is not +suited for benchmarking. + + Debug and Verification ---------------------- @@ -69,6 +88,16 @@ Specifying ``BENCHMARK=off`` turns on verification Please note that the generated binary will therefore exhibit a poor performance. + +Release and Verification +------------------------ + +Verification with the debug builds can be extremely slow. Hence verification +capabilities can be build with release builds: :: + + make BENCHMARK=off + + Benchmarking ------------ @@ -77,16 +106,11 @@ To generate a binary for benchmarking run make with :: make As default ``BENCHMARK=on`` and ``BUILD=release`` is set, where -BUILD=release turns optimizations on and ``BENCHMARK=on`` disables +``BUILD=release`` turns optimizations on and ``BENCHMARK=on`` disables verfification, statistics, and VTK output. -Release and Verification ------------------------- - -Verification with the debug builds can be extremely slow. Hence verification -capabilities can be build with release builds: :: - - make BENCHMARK=off +See Options Summary below for further description of options which can be +applied, e.g. TARCH as well as the Benchmarking section. Compilers --------- @@ -116,15 +140,15 @@ all object and dependency files are deleted. Options Summary --------------- -Options that can be specified when building the framework with make: +Options that can be specified when building the suite with make: ============= ======================= ============ ========================================================== name values default description -------------- ----------------------- ------------ ---------------------------------------------------------- +============= ======================= ============ ========================================================== BENCHMARK on, off on If enabled, disables VERIFICATION, STATISTICS, VTK_OUTPUT. If disabled enables the three former options. -BUILD debug, release release No optimization, debug symbols, DEBUG defined. +BUILD debug, release release debug: no optimization, debug symbols, DEBUG defined. release: optimizations enabled. CONFIG linux-gcc, linux-intel linux-intel Select GCC or Intel compiler. -ISA avx, sse avx Determines which ISA extension is used for macro definitions. This is *not* the architecture the compiler generates code for. +ISA avx, sse avx Determines which ISA extension is used for macro definitions of the intrinsics. This is *not* the architecture the compiler generates code for. OPENMP on, off on OpenMP, i.\,e.\. threading support. STATISTICS on, off off View statistics, like density etc, during simulation. TARCH -- -- Via TARCH the architecture the compiler generates code for can be overridden. The value depends on the chosen compiler. @@ -175,7 +199,7 @@ iterations, etc, which can afterward be override, e.g.: :: $ bin/lbmbenchk-linux-intel-release -verfiy -dims 32x32x32 -Kernel specific parameters can be opatained via selecting the specific kernel +Kernel specific parameters can be obtained via selecting the specific kernel and passing ``-h`` as parameter: :: $ bin/lbmbenchk-linux-intel-release -kernel kernel-name -- -h @@ -299,13 +323,51 @@ created make sure the binary was compiled with: - ``BUILD=release`` (default if not overriden) and - the correct ISA for macros is used, selected via ``ISA`` and - use ``TARCH`` to specify the architecture the compiler generates code for. + +Intel Compiler +-------------- + +For the Intel compiler one can specify depending on the target ISA extension: + +- AVX: ``TARCH=-xAVX`` +- AVX2 and FMA: ``TARCH=-xCORE-AVX2,-fma`` +- AVX512: ``TARCH=-xCORE-AVX512`` +- KNL: ``TARCH=-xMIC-AVX512`` + +Compiling for an architecture supporting AVX (Sandy Bridge, Ivy Bridge): :: + + make ISA=avx TARCH=-xAVX + + +Compiling for an architecture supporting AVX2 (Haswell, Broadwell): :: + + make ISA=avx TARCH=-xCORE-AVX2,-fma + +WARNING: ISA is here still set to ``avx`` as currently we have the FMA intrinsics not +implemented. This might change in the future. + + +Compiling for an architecture supporting AVX-512 (Skylake): :: + + make ISA=avx TARCH=-xCORE-AVX512 + +WARNING: ISA is here still set to ``avx`` as currently we have no implementation for the +AVX512 intrinsics. This might change in the future. + + +Pinning +------- During benchmarking pinning should be used via the ``-pin`` parameter. Running -a benchmark with 10 threads an pin them to the first 10 cores works like :: +a benchmark with 10 threads and pin them to the first 10 cores works like :: $ bin/lbmbenchk-linux-intel-release ... -t 10 -pin $(seq -s , 0 9) -Things the binary does nor check or controll: + +General Remarks +--------------- + +Things the binary does nor check or control: - transparent huge pages: when allocating memory small 4 KiB pages might be replaced with larger ones. This is in general a good thing, but if this is @@ -326,7 +388,7 @@ Things the binary does nor check or controll: used is already full memory might be allocated in a remote domain. Accesses to remote domains typically have a higher latency and lower bandwidth. -- System load: interference with other application, espcially on desktop +- System load: interference with other application, especially on desktop systems should be avoided. - Padding: For SoA based kernels the number of (fluid) nodes is automatically @@ -378,14 +440,134 @@ like ``-pad 0+16`` would at a static padding of two nodes (one node = 8 b). Geometries ========== -TODO: supported geometries: channel, pipe, blocks - - -Results -======= - -TODO - +TODO: supported geometries: channel, pipe, blocks, fluid + + +Performance Results +=================== + +The sections lists performance values measured on several machines for +different kernels and geometries. +The **RFM** column denotes the expected performance as predicted by the +Roofline performance model [williams-2008]_. +For performance prediction of each kernel a memory bandwidth benchmark is used +which mimics the kernels memory access pattern and the kernel's loop balance +(see [kernels]_ for details). + +Haswell, Intel Xeon E5-2695 v3 +------------------------------ + +- Haswell architecture, AVX2, FMA +- 14 cores, 2,3 GHz +- 2 x 7 cores in cluster-on-die (CoD) mode enabled +- SMT enabled + +memory bandwidth: + +- copy-19 47.3 GB/s +- copy-19-nt-sl 47.1 GB/s +- update-19 44.0 GB/s + +geometry dimensions: 500x100x100 + +========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ===== +kernel pipe blocks-2 blocks-4 blocks-6 blocks-8 blocks-10 blocks-15 blocks-16 blocks-20 blocks-25 blocks-32 RFM +========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ===== +blk-push-aos 58.82 49.85 57.34 59.90 61.37 62.17 65.30 64.00 67.54 64.46 69.69 104 +blk-push-soa 32.32 33.46 34.02 34.64 35.06 35.04 36.31 35.44 37.20 35.14 37.95 104 +blk-pull-aos 56.97 51.41 56.09 57.92 59.98 59.83 63.37 61.55 65.50 63.11 67.02 104 +blk-pull-soa 49.29 46.23 47.50 51.97 51.27 49.52 55.23 53.13 54.50 49.79 57.90 104 +aa-aos 91.35 66.14 76.80 84.76 83.63 91.36 93.46 92.62 93.91 92.25 92.93 145 +aa-soa 75.51 65.68 70.94 71.36 73.83 75.46 74.84 79.48 83.28 77.70 82.72 145 +aa-vec-soa 93.85 83.44 91.58 93.96 94.35 96.62 101.76 96.72 106.37 102.60 110.28 145 +list-push-aos 80.29 80.97 80.95 81.10 81.37 82.44 81.77 81.49 80.72 81.93 80.93 83 +list-push-soa 47.52 42.65 45.28 46.64 43.46 40.59 44.94 46.55 41.53 45.98 44.86 83 +list-pull-aos 85.30 82.97 86.43 83.42 86.33 83.70 86.43 83.77 83.10 85.89 84.44 83 +list-pull-soa 62.12 63.61 63.28 61.32 66.72 62.65 64.82 60.49 58.01 64.46 62.52 83 +list-pull-split-nt-1s-soa 121.35 113.77 115.29 113.54 117.00 116.46 114.78 114.54 110.83 112.67 117.85 125 +list-pull-split-nt-2s-soa 118.09 110.48 112.55 113.18 113.44 111.85 109.27 114.41 110.28 111.78 113.74 125 +list-aa-aos 121.28 118.63 119.00 118.50 121.99 119.11 118.83 121.47 121.62 126.18 120.12 129 +list-aa-soa 126.34 116.90 129.45 127.12 129.41 121.42 126.19 126.76 126.70 124.40 125.22 129 +list-aa-ria-soa 133.68 121.82 126.04 128.46 131.15 132.25 128.78 133.50 126.69 124.40 130.37 145 +list-aa-pv-soa 146.22 124.39 130.73 136.29 137.61 131.21 138.65 138.78 127.02 132.40 138.37 145 +========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ===== + + +Broadwell, Intel Xeon E5-2630 v4 +-------------------------------- + +- Broadwell architecture, AVX2, FMA +- 10 cores, 2.2 GHz +- SMT disabled + +memory bandwidth: + +- copy-19 48.0 GB/s +- copy-nt-sl-19 48.2 GB/s +- update-19 51.1 GB/s + +geometry dimensions: 500x100x100 + +========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ======= +kernel pipe blocks-2 blocks-4 blocks-6 blocks-8 blocks-10 blocks-15 blocks-16 blocks-20 blocks-25 blocks-32 RFM +========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ======= +blk-push-aos 55.75 47.62 54.57 57.10 58.49 59.00 61.72 60.56 64.05 61.10 66.03 105 +blk-push-soa 30.06 31.09 32.13 32.54 32.74 32.72 33.81 33.19 34.90 33.21 35.75 105 +blk-pull-aos 53.80 48.61 53.08 54.99 56.08 56.68 59.20 58.12 61.49 58.71 63.45 105 +blk-pull-soa 46.96 46.61 48.84 49.70 50.33 50.46 52.36 51.39 54.20 51.61 55.71 105 +aa-aos 91.40 66.99 78.47 83.38 86.62 88.62 92.98 91.54 97.08 94.93 98.90 168 +aa-soa 83.01 69.96 75.85 77.72 79.01 79.29 82.38 80.11 85.70 83.91 87.69 168 +aa-vec-soa 112.03 96.52 105.32 109.76 112.55 113.82 120.55 118.37 126.30 121.37 131.94 168 +list-push-aos 75.13 74.18 75.20 75.42 75.24 75.99 75.80 75.80 75.54 76.22 76.21 97 +list-push-soa 40.99 38.14 39.00 38.89 38.89 39.67 39.87 39.28 39.35 40.08 40.13 97 +list-pull-aos 82.07 82.88 83.29 83.09 83.32 83.49 82.82 82.88 83.32 82.60 82.93 97 +list-pull-soa 62.07 60.40 61.89 61.39 62.43 60.90 60.48 62.80 62.50 61.10 60.38 97 +list-pull-split-nt-1s-soa 125.81 120.60 121.96 122.34 122.86 123.53 123.64 123.67 125.94 124.09 123.69 128 +list-pull-split-nt-2s-soa 122.79 117.16 118.86 119.16 119.56 119.99 120.01 120.03 122.64 120.57 120.39 128 +list-aa-aos 128.13 127.41 129.31 129.07 129.79 129.63 129.67 129.94 129.12 128.41 129.72 150 +list-aa-soa 141.60 139.78 141.58 142.16 141.94 141.31 142.37 142.25 142.43 141.40 142.26 150 +list-aa-ria-soa 141.82 134.88 140.15 140.72 141.67 140.51 141.18 141.29 142.97 141.94 143.25 168 +list-aa-pv-soa 164.79 140.95 159.24 161.78 162.40 163.04 164.69 164.38 165.11 165.75 166.09 168 +========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ======= + + +Skylake, Intel Xeon Gold 6148 +----------------------------- + +- Skylake architecture, AVX2, FMA, AVX512 +- 20 cores, 2.4 GHz +- SMT enabled + +memory bandwidth: + +- copy-19 89.7 GB/s +- copy-19-nt-sl 92.4 GB/s +- update-19 93.6 GB/s + +geometry dimensions: 500x100x100 + + +========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= === +kernel pipe blocks-2 blocks-4 blocks-6 blocks-8 blocks-10 blocks-15 blocks-16 blocks-20 blocks-25 blocks-32 RFM +========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= === +blk-push-aos 113.01 93.99 108.98 114.65 117.87 119.47 124.95 122.46 129.29 123.87 133.01 197 +blk-push-soa 100.21 98.87 103.63 105.56 107.02 107.27 111.61 109.83 116.16 110.51 110.29 197 +blk-pull-aos 118.45 102.54 114.12 117.82 122.69 124.31 130.58 127.85 135.72 129.65 139.94 197 +blk-pull-soa 82.60 83.36 87.13 88.39 88.84 88.96 92.48 90.93 95.79 91.92 98.64 197 +aa-aos 171.32 125.43 147.73 157.70 163.35 167.25 175.39 174.20 182.54 173.67 187.76 308 +aa-soa 180.85 152.39 165.84 152.59 171.90 175.76 184.94 182.34 189.43 180.30 193.54 308 +aa-vec-soa 208.03 181.51 195.86 203.41 209.08 212.34 224.05 219.49 234.31 225.92 245.22 308 +list-push-aos 158.81 164.67 162.93 163.05 165.22 164.31 164.66 160.78 164.07 165.19 164.06 177 +list-push-soa 134.60 110.44 110.17 132.01 132.95 133.46 134.37 134.33 135.12 134.91 137.87 177 +list-pull-aos 169.61 170.03 170.89 170.90 171.20 171.60 172.09 171.95 169.48 172.08 171.02 177 +list-pull-soa 120.50 116.73 118.62 118.00 120.99 118.15 117.17 121.41 120.83 120.00 118.74 177 +list-pull-split-nt-1s-soa 225.59 224.18 225.10 226.34 226.01 230.37 227.50 228.42 227.39 231.65 227.35 246 +list-pull-split-nt-2s-soa 219.20 214.63 217.61 218.13 219.07 221.01 219.88 220.09 220.62 221.68 220.58 246 +list-aa-aos 241.39 239.27 239.53 242.56 242.46 243.00 242.91 242.46 241.24 242.96 241.52 275 +list-aa-soa 273.73 268.49 268.48 271.79 275.29 274.56 277.18 272.67 274.21 275.24 278.21 275 +list-aa-ria-soa 288.42 261.89 273.26 284.84 283.88 288.29 290.72 289.81 293.36 290.75 292.93 308 +list-aa-pv-soa 303.35 267.21 289.18 294.96 294.36 298.16 300.45 301.71 302.37 302.88 304.46 308 +========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= === Licence ======= @@ -401,6 +583,19 @@ This work was funded by BMBF, grant no. 01IH15003A (project SKAMPY). This work was funded by KONWHIR project OMI4PAPS. +Bibliography +============ + +.. [ginzburg-2008] + I. Ginzburg, F. Verhaeghe, and D. d'Humières. + Two-relaxation-time lattice Boltzmann scheme: About parametrization, velocity, pressure and mixed boundary conditions. + Commun. Comput. Phys., 3(2):427-478, 2008. + +.. [williams-2008] + S. Williams, A. Waterman, and D. Patterson. + Roofline: an insightful visual performance model for multicore architectures. + Commun. ACM, 52(4):65-76, Apr 2009. doi:10.1145/1498765.1498785 + .. |datetime| date:: %Y-%m-%d %H:%M -- 2.25.1