X-Git-Url: http://git.rrze.uni-erlangen.de/gitweb/?p=LbmBenchmarkKernelsPublic.git;a=blobdiff_plain;f=doc%2Fhtml%2Fmain.html;fp=doc%2Fhtml%2Fmain.html;h=99f4cb847eb50e75fbbfdbbf68c260645dd031ef;hp=511f6b2dcf0bfd7cd2be208fc20e3d3f83c391e3;hb=0095f461c30075a883df0265a7b831061ee7ebee;hpb=e3f82424829ebb623343ce0092238f83b4a1b8c2 diff --git a/doc/html/main.html b/doc/html/main.html index 511f6b2..99f4cb8 100644 --- a/doc/html/main.html +++ b/doc/html/main.html @@ -420,42 +420,66 @@ tr:nth-child(odd) {

Contents

+
+

1   Introduction

+

The lattice Boltzmann (LBM) benchmark kernels are a collection of LBM kernel +implementations.

+

AS SUCH THE LBM BENCHMARK KERNELS ARE NO FULLY EQUIPPED CFD SOLVER AND SOLELY +SERVES THE PURPOSE OF STUDYING POSSIBLE PERFORMANCE OPTIMIZATIONS AND/OR +EXPERIMENTS.

+

Currently all kernels utilize a D3Q19 discretization and the +two-relaxation-time (TRT) collision operator [ginzburg-2008]. +All operations are carried out in double precision arithmetic.

+
-

1   Compilation

+

2   Compilation

The benchmark framework currently supports only Linux systems and the GCC and Intel compilers. Every other configuration probably requires adjustment inside -the code and the makefiles. Further some code might be platform or at least +the code and the makefiles. Furthermore some code might be platform or at least POSIX specific.

The benchmark can be build via make from the src subdirectory. This will generate one binary which hosts all implemented benchmark kernels.

Binaries are located under the bin subdirectory and will have different names depending on compiler and build configuration.

+

Compilation can target debug or release builds. Combined with both build types +verification can be enabled, which increases the runtime and hence is not +suited for benchmarking.

-

1.1   Debug and Verification

+

2.1   Debug and Verification

 make BUILD=debug BENCHMARK=off
 
@@ -470,32 +494,34 @@ binary will be found in the bin subdirectory a

Please note that the generated binary will therefore exhibit a poor performance.

+
+

2.2   Release and Verification

+

Verification with the debug builds can be extremely slow. Hence verification +capabilities can be build with release builds:

+
+make BENCHMARK=off
+
+
-

1.2   Benchmarking

+

2.3   Benchmarking

To generate a binary for benchmarking run make with

 make
 

As default BENCHMARK=on and BUILD=release is set, where -BUILD=release turns optimizations on and BENCHMARK=on disables +BUILD=release turns optimizations on and BENCHMARK=on disables verfification, statistics, and VTK output.

-
-
-

1.3   Release and Verification

-

Verification with the debug builds can be extremely slow. Hence verification -capabilities can be build with release builds:

-
-make BENCHMARK=off
-
+

See Options Summary below for further description of options which can be +applied, e.g. TARCH as well as the Benchmarking section.

-

1.4   Compilers

+

2.4   Compilers

Currently only the GCC and Intel compiler under Linux are supported. Between both configuration can be chosen via CONFIG=linux-gcc or CONFIG=linux-intel.

-

1.5   Cleaning

+

2.5   Cleaning

For each configuration and build (debug/release) a subdirectory under the src/obj directory is created where the dependency and object files are stored. @@ -510,21 +536,23 @@ make clean-all

all object and dependency files are deleted.

-

1.6   Options Summary

-

Options that can be specified when building the framework with make:

+

2.6   Options Summary

+

Options that can be specified when building the suite with make:

---+++ - - - - - + + + + + + + @@ -533,7 +561,7 @@ make clean-all - + @@ -543,7 +571,7 @@ make clean-all - + @@ -575,7 +603,7 @@ make clean-all
-

2   Invocation

+

3   Invocation

Running the binary will print among the GPL licence header a line like the following:

 LBM Benchmark Kernels 0.1, compiled Jul  5 2017 21:59:22, type: verification
@@ -586,7 +614,7 @@ LBM Benchmark Kernels 0.1, compiled Jul  5 2017 21:59:22, type: benchmark
 

if verfication was disabled during compilation.

-

2.1   Command Line Parameters

+

3.1   Command Line Parameters

Running the binary with -h list all available parameters:

 Usage:
@@ -613,7 +641,7 @@ iterations, etc, which can afterward be override, e.g.:

 $ bin/lbmbenchk-linux-intel-release -verfiy -dims 32x32x32
 
-

Kernel specific parameters can be opatained via selecting the specific kernel +

Kernel specific parameters can be obtained via selecting the specific kernel and passing -h as parameter:

 $ bin/lbmbenchk-linux-intel-release -kernel kernel-name -- -h
@@ -651,7 +679,7 @@ Available kernels to benchmark:
 
-

2.2   Kernels

+

3.2   Kernels

The following list shortly describes available kernels:

  • push-soa/push-aos/pull-soa/pull-aos: @@ -862,8 +890,8 @@ during each run.

namevaluesdefaultdescription
namevaluesdefaultdescription
BENCHMARK on, off on
BUILD debug, release releaseNo optimization, debug symbols, DEBUG defined.debug: no optimization, debug symbols, DEBUG defined. release: optimizations enabled.
CONFIG linux-gcc, linux-intel
ISA avx, sse avxDetermines which ISA extension is used for macro definitions. This is not the architecture the compiler generates code for.Determines which ISA extension is used for macro definitions of the intrinsics. This is not the architecture the compiler generates code for.
OPENMP on, off
-
-

3   Benchmarking

+
+

4   Benchmarking

Correct benchmarking is a nontrivial task. Whenever benchmark results should be created make sure the binary was compiled with:

+
+

4.1   Intel Compiler

+

For the Intel compiler one can specify depending on the target ISA extension:

+
    +
  • AVX: TARCH=-xAVX
  • +
  • AVX2 and FMA: TARCH=-xCORE-AVX2,-fma
  • +
  • AVX512: TARCH=-xCORE-AVX512
  • +
  • KNL: TARCH=-xMIC-AVX512
  • +
+

Compiling for an architecture supporting AVX (Sandy Bridge, Ivy Bridge):

+
+make ISA=avx TARCH=-xAVX
+
+

Compiling for an architecture supporting AVX2 (Haswell, Broadwell):

+
+make ISA=avx TARCH=-xCORE-AVX2,-fma
+
+

WARNING: ISA is here still set to avx as currently we have the FMA intrinsics not +implemented. This might change in the future.

+

Compiling for an architecture supporting AVX-512 (Skylake):

+
+make ISA=avx TARCH=-xCORE-AVX512
+
+

WARNING: ISA is here still set to avx as currently we have no implementation for the +AVX512 intrinsics. This might change in the future.

+
+
+

4.2   Pinning

During benchmarking pinning should be used via the -pin parameter. Running -a benchmark with 10 threads an pin them to the first 10 cores works like

+a benchmark with 10 threads and pin them to the first 10 cores works like

 $ bin/lbmbenchk-linux-intel-release ... -t 10 -pin $(seq -s , 0 9)
 
-

Things the binary does nor check or controll:

+
+
+

4.3   General Remarks

+

Things the binary does nor check or control:

  • transparent huge pages: when allocating memory small 4 KiB pages might be replaced with larger ones. This is in general a good thing, but if this is @@ -895,7 +954,7 @@ means the memory will be placed at the NUMA domain the touching core is associated with. If a different policy is in place or the NUMA domain to be used is already full memory might be allocated in a remote domain. Accesses to remote domains typically have a higher latency and lower bandwidth.
  • -
  • System load: interference with other application, espcially on desktop +
  • System load: interference with other application, especially on desktop systems should be avoided.
  • Padding: For SoA based kernels the number of (fluid) nodes is automatically adjusted so that no cache or TLB thrashing should occur. The parameters are @@ -905,8 +964,9 @@ padding section.
  • function for different ISA extensions. Make sure the code you might think is executed is actually the code which is executed.
+
-

3.1   Padding

+

4.4   Padding

With correct padding cache and TLB thrashing can be avoided. Therefore the number of (fluid) nodes used in the data layout is artificially increased.

Currently automatic padding is active for kernels which support it. It can be @@ -938,22 +998,912 @@ like -pad 0+16 would

-

4   Geometries

-

TODO: supported geometries: channel, pipe, blocks

+

5   Geometries

+

TODO: supported geometries: channel, pipe, blocks, fluid

+
+
+

6   Performance Results

+

The sections lists performance values measured on several machines for +different kernels and geometries. +The RFM column denotes the expected performance as predicted by the +Roofline performance model [williams-2008]. +For performance prediction of each kernel a memory bandwidth benchmark is used +which mimics the kernels memory access pattern and the kernel's loop balance +(see [kernels] for details).

+
+

6.1   Haswell, Intel Xeon E5-2695 v3

+
    +
  • Haswell architecture, AVX2, FMA
  • +
  • 14 cores, 2,3 GHz
  • +
  • 2 x 7 cores in cluster-on-die (CoD) mode enabled
  • +
  • SMT enabled
  • +
+

memory bandwidth:

+
    +
  • copy-19 47.3 GB/s
  • +
  • copy-19-nt-sl 47.1 GB/s
  • +
  • update-19 44.0 GB/s
  • +
+

geometry dimensions: 500x100x100

+ +++++++++++++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
kernelpipeblocks-2blocks-4blocks-6blocks-8blocks-10blocks-15blocks-16blocks-20blocks-25blocks-32RFM
blk-push-aos58.8249.8557.3459.9061.3762.1765.3064.0067.5464.4669.69104
blk-push-soa32.3233.4634.0234.6435.0635.0436.3135.4437.2035.1437.95104
blk-pull-aos56.9751.4156.0957.9259.9859.8363.3761.5565.5063.1167.02104
blk-pull-soa49.2946.2347.5051.9751.2749.5255.2353.1354.5049.7957.90104
aa-aos91.3566.1476.8084.7683.6391.3693.4692.6293.9192.2592.93145
aa-soa75.5165.6870.9471.3673.8375.4674.8479.4883.2877.7082.72145
aa-vec-soa93.8583.4491.5893.9694.3596.62101.7696.72106.37102.60110.28145
list-push-aos80.2980.9780.9581.1081.3782.4481.7781.4980.7281.9380.9383
list-push-soa47.5242.6545.2846.6443.4640.5944.9446.5541.5345.9844.8683
list-pull-aos85.3082.9786.4383.4286.3383.7086.4383.7783.1085.8984.4483
list-pull-soa62.1263.6163.2861.3266.7262.6564.8260.4958.0164.4662.5283
list-pull-split-nt-1s-soa121.35113.77115.29113.54117.00116.46114.78114.54110.83112.67117.85125
list-pull-split-nt-2s-soa118.09110.48112.55113.18113.44111.85109.27114.41110.28111.78113.74125
list-aa-aos121.28118.63119.00118.50121.99119.11118.83121.47121.62126.18120.12129
list-aa-soa126.34116.90129.45127.12129.41121.42126.19126.76126.70124.40125.22129
list-aa-ria-soa133.68121.82126.04128.46131.15132.25128.78133.50126.69124.40130.37145
list-aa-pv-soa146.22124.39130.73136.29137.61131.21138.65138.78127.02132.40138.37145
+
+
+

6.2   Broadwell, Intel Xeon E5-2630 v4

+
    +
  • Broadwell architecture, AVX2, FMA
  • +
  • 10 cores, 2.2 GHz
  • +
  • SMT disabled
  • +
+

memory bandwidth:

+
    +
  • copy-19 48.0 GB/s
  • +
  • copy-nt-sl-19 48.2 GB/s
  • +
  • update-19 51.1 GB/s
  • +
+

geometry dimensions: 500x100x100

+ +++++++++++++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
kernelpipeblocks-2blocks-4blocks-6blocks-8blocks-10blocks-15blocks-16blocks-20blocks-25blocks-32RFM
blk-push-aos55.7547.6254.5757.1058.4959.0061.7260.5664.0561.1066.03105
blk-push-soa30.0631.0932.1332.5432.7432.7233.8133.1934.9033.2135.75105
blk-pull-aos53.8048.6153.0854.9956.0856.6859.2058.1261.4958.7163.45105
blk-pull-soa46.9646.6148.8449.7050.3350.4652.3651.3954.2051.6155.71105
aa-aos91.4066.9978.4783.3886.6288.6292.9891.5497.0894.9398.90168
aa-soa83.0169.9675.8577.7279.0179.2982.3880.1185.7083.9187.69168
aa-vec-soa112.0396.52105.32109.76112.55113.82120.55118.37126.30121.37131.94168
list-push-aos75.1374.1875.2075.4275.2475.9975.8075.8075.5476.2276.2197
list-push-soa40.9938.1439.0038.8938.8939.6739.8739.2839.3540.0840.1397
list-pull-aos82.0782.8883.2983.0983.3283.4982.8282.8883.3282.6082.9397
list-pull-soa62.0760.4061.8961.3962.4360.9060.4862.8062.5061.1060.3897
list-pull-split-nt-1s-soa125.81120.60121.96122.34122.86123.53123.64123.67125.94124.09123.69128
list-pull-split-nt-2s-soa122.79117.16118.86119.16119.56119.99120.01120.03122.64120.57120.39128
list-aa-aos128.13127.41129.31129.07129.79129.63129.67129.94129.12128.41129.72150
list-aa-soa141.60139.78141.58142.16141.94141.31142.37142.25142.43141.40142.26150
list-aa-ria-soa141.82134.88140.15140.72141.67140.51141.18141.29142.97141.94143.25168
list-aa-pv-soa164.79140.95159.24161.78162.40163.04164.69164.38165.11165.75166.09168
+
+
+

6.3   Skylake, Intel Xeon Gold 6148

+
    +
  • Skylake architecture, AVX2, FMA, AVX512
  • +
  • 20 cores, 2.4 GHz
  • +
  • SMT enabled
  • +
+

memory bandwidth:

+
    +
  • copy-19 89.7 GB/s
  • +
  • copy-19-nt-sl 92.4 GB/s
  • +
  • update-19 93.6 GB/s
  • +
+

geometry dimensions: 500x100x100

+ +++++++++++++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
kernelpipeblocks-2blocks-4blocks-6blocks-8blocks-10blocks-15blocks-16blocks-20blocks-25blocks-32RFM
blk-push-aos113.0193.99108.98114.65117.87119.47124.95122.46129.29123.87133.01197
blk-push-soa100.2198.87103.63105.56107.02107.27111.61109.83116.16110.51110.29197
blk-pull-aos118.45102.54114.12117.82122.69124.31130.58127.85135.72129.65139.94197
blk-pull-soa82.6083.3687.1388.3988.8488.9692.4890.9395.7991.9298.64197
aa-aos171.32125.43147.73157.70163.35167.25175.39174.20182.54173.67187.76308
aa-soa180.85152.39165.84152.59171.90175.76184.94182.34189.43180.30193.54308
aa-vec-soa208.03181.51195.86203.41209.08212.34224.05219.49234.31225.92245.22308
list-push-aos158.81164.67162.93163.05165.22164.31164.66160.78164.07165.19164.06177
list-push-soa134.60110.44110.17132.01132.95133.46134.37134.33135.12134.91137.87177
list-pull-aos169.61170.03170.89170.90171.20171.60172.09171.95169.48172.08171.02177
list-pull-soa120.50116.73118.62118.00120.99118.15117.17121.41120.83120.00118.74177
list-pull-split-nt-1s-soa225.59224.18225.10226.34226.01230.37227.50228.42227.39231.65227.35246
list-pull-split-nt-2s-soa219.20214.63217.61218.13219.07221.01219.88220.09220.62221.68220.58246
list-aa-aos241.39239.27239.53242.56242.46243.00242.91242.46241.24242.96241.52275
list-aa-soa273.73268.49268.48271.79275.29274.56277.18272.67274.21275.24278.21275
list-aa-ria-soa288.42261.89273.26284.84283.88288.29290.72289.81293.36290.75292.93308
list-aa-pv-soa303.35267.21289.18294.96294.36298.16300.45301.71302.37302.88304.46308
-
-

6   Licence

+

7   Licence

The Lattice Boltzmann Benchmark Kernels are licensed under GPLv3.

-

7   Acknowledgements

+

8   Acknowledgements

This work was funded by BMBF, grant no. 01IH15003A (project SKAMPY).

This work was funded by KONWHIR project OMI4PAPS.

-

Document was generated at 2017-11-02 15:33.

+
+
+

9   Bibliography

+ + + + + +
[ginzburg-2008]I. Ginzburg, F. Verhaeghe, and D. d'Humières. +Two-relaxation-time lattice Boltzmann scheme: About parametrization, velocity, pressure and mixed boundary conditions. +Commun. Comput. Phys., 3(2):427-478, 2008.
+ + + + + +
[williams-2008]S. Williams, A. Waterman, and D. Patterson. +Roofline: an insightful visual performance model for multicore architectures. +Commun. ACM, 52(4):65-76, Apr 2009. doi:10.1145/1498765.1498785
+

Document was generated at 2017-11-21 15:43.