X-Git-Url: http://git.rrze.uni-erlangen.de/gitweb/?p=LbmBenchmarkKernelsPublic.git;a=blobdiff_plain;f=doc%2Fmain.html;fp=doc%2Fmain.html;h=9f1186603c5019c8cbf6dad1ad6594bffb7d4de0;hp=0000000000000000000000000000000000000000;hb=0fde6e45e9be83893afae896cf49a799777f6d7c;hpb=712d0b8fc4a382e1cfe4edef8b0ade11b0a2ce25 diff --git a/doc/main.html b/doc/main.html new file mode 100644 index 0000000..9f11866 --- /dev/null +++ b/doc/main.html @@ -0,0 +1,1239 @@ + + + +
+ + +LBM Benchmark Kernels Documentation
+Contents
+ +The lattice Boltzmann (LBM) benchmark kernels are a collection of LBM kernel +implementations.
+AS SUCH THE LBM BENCHMARK KERNELS ARE NO FULLY EQUIPPED CFD SOLVER AND SOLELY +SERVES THE PURPOSE OF STUDYING POSSIBLE PERFORMANCE OPTIMIZATIONS AND/OR +EXPERIMENTS.
+Currently all kernels utilize a D3Q19 discretization and the +two-relaxation-time (TRT) collision operator [ginzburg-2008]. +All operations are carried out in double or single precision arithmetic.
+The benchmark framework currently supports only Linux systems and the GCC and +Intel compilers. Every other configuration probably requires adjustment inside +the code and the makefiles. Furthermore some code might be platform or at least +POSIX specific.
+The benchmark can be build via make from the src subdirectory. This will +generate one binary which hosts all implemented benchmark kernels.
+Binaries are located under the bin subdirectory and will have different names +depending on compiler and build configuration.
+Compilation can target debug or release builds. Combined with both build types +verification can be enabled, which increases the runtime and hence is not +suited for benchmarking.
++make BUILD=debug BENCHMARK=off ++
Running make with BUILD=debug builds the debug version of +the benchmark kernels, where no optimizations are performed, line numbers and +debug symbols are included as well as DEBUG will be defined. The resulting +binary will be found in the bin subdirectory and named +lbmbenchk-linux-<compiler>-debug.
+Specifying BENCHMARK=off turns on verification +(VERIFICATION=on), statistics (STATISTICS=on), and VTK output +(VTK_OUTPUT=on) enabled.
+Please note that the generated binary will therefore +exhibit a poor performance.
+Verification with the debug builds can be extremely slow. Hence verification +capabilities can be build with release builds:
++make BENCHMARK=off ++
To generate a binary for benchmarking run make with
++make ++
As default BENCHMARK=on and BUILD=release is set, where +BUILD=release turns optimizations on and BENCHMARK=on disables +verfification, statistics, and VTK output.
+See Options Summary below for further description of options which can be +applied, e.g. TARCH as well as the Benchmarking section.
+Currently only the GCC and Intel compiler under Linux are supported. Between +both configuration can be chosen via CONFIG=linux-gcc or +CONFIG=linux-intel.
+As default double precision data types are used for storing PDFs and floating +point constants. Furthermore, this is the default for the intrincis kernels. +With the PRECISION=sp variable this can be changed to single precision.
++make PRECISION=sp # build for single precision kernels + +make PRECISION=dp # build for double precision kernels (defalt) ++
For each configuration and build (debug/release) a subdirectory under the +src/obj directory is created where the dependency and object files are +stored. +With
++make CONFIG=... BUILD=... clean ++
a specific combination is select and cleaned, whereas with
++make clean-all ++
all object and dependency files are deleted.
+Options that can be specified when building the suite with make:
+name | +values | +default | +description | +
---|---|---|---|
BENCHMARK | +on, off | +on | +If enabled, disables VERIFICATION, STATISTICS, VTK_OUTPUT. If disabled enables the three former options. | +
BUILD | +debug, release | +release | +debug: no optimization, debug symbols, DEBUG defined. release: optimizations enabled. | +
CONFIG | +linux-gcc, linux-intel | +linux-intel | +Select GCC or Intel compiler. | +
ISA | +avx, sse | +avx | +Determines which ISA extension is used for macro definitions of the intrinsics. This is not the architecture the compiler generates code for. | +
OPENMP | +on, off | +on | +OpenMP, i.,e.. threading support. | +
PRECISION | +dp, sp | +dp | +Floating point precision used for data type, arithmetic, and intrincics. | +
STATISTICS | +on, off | +off | +View statistics, like density etc, during simulation. | +
TARCH | +-- | +-- | +Via TARCH the architecture the compiler generates code for can be overridden. The value depends on the chosen compiler. | +
VERIFICATION | +on, off | +off | +Turn verification on/off. | +
VTK_OUTPUT | +on, off | +off | +Enable/Disable VTK file output. | +
Running the binary will print among the GPL licence header a line like the following:
++LBM Benchmark Kernels 0.1, compiled Jul 5 2017 21:59:22, type: verification ++
if verfication was enabled during compilation or
++LBM Benchmark Kernels 0.1, compiled Jul 5 2017 21:59:22, type: benchmark ++
if verfication was disabled during compilation.
+Running the binary with -h list all available parameters:
++Usage: +./lbmbenchk -list +./lbmbenchk + [-dims XxYyZ] [-geometry box|channel|pipe|blocks[-<block size>]] [-iterations <iterations>] [-lattice-dump-ascii] + [-rho-in <density>] [-rho-out <density] [-omega <omega>] [-kernel <kernel>] + [-periodic-x] + [-t <number of threads>] + [-pin core{,core}*] + [-verify] + -- <kernel specific parameters> + +-list List available kernels. + +-dims XxYxZ Specify geometry dimensions. + +-geometry blocks-<block size> + Geometetry with blocks of size <block size> regularily layout out. ++
If an option is specified multiple times the last one overrides previous ones. +This holds also true for -verify which sets geometry dimensions, +iterations, etc, which can afterward be override, e.g.:
++$ bin/lbmbenchk-linux-intel-release-dp -verfiy -dims 32x32x32 ++
Kernel specific parameters can be obtained via selecting the specific kernel +and passing -h as parameter:
++$ bin/lbmbenchk-linux-intel-release-dp -kernel kernel-name -- -h +... +Kernel parameters: +[-blk <n>] [-blk-[xyz] <n>] ++
A list of all available kernels can be obtained via -list:
++$ ../bin/lbmbenchk-linux-gcc-debug-dp -list +Lattice Boltzmann Benchmark Kernels (LbmBenchKernels) Copyright (C) 2016, 2017 LSS, RRZE +This program comes with ABSOLUTELY NO WARRANTY; for details see LICENSE. +This is free software, and you are welcome to redistribute it under certain conditions. + +LBM Benchmark Kernels 0.1, compiled Jul 5 2017 21:59:22, type: verification +Available kernels to benchmark: + list-aa-pv-soa + list-aa-ria-soa + list-aa-soa + list-aa-aos + list-pull-split-nt-1s-soa + list-pull-split-nt-2s-soa + list-push-soa + list-push-aos + list-pull-soa + list-pull-aos + push-soa + push-aos + pull-soa + pull-aos + blk-push-soa + blk-push-aos + blk-pull-soa + blk-pull-aos ++
The following list shortly describes available kernels:
+Note that all array of structures (aos) kernels might require blocking +(depending on the domain size) to reach the performance of their structure of +arrays (soa) counter parts.
+The following table summarizes the properties of the kernels. Here D means +direct addressing, i.e. full array, I means indirect addressing, i.e. 1D +vector with adjacency list, x means supported, whereas -- means unsupported. +The loop balance B_l is computed for D3Q19 model with double precision floating +point for PDFs (8 byte) and 4 byte integers for the index (adjacency list). +As list-aa-ria-soa and list-aa-pv-soa support run length coding their effective +loop balance depends on the geometry. The effective loop balance is printed +during each run.
+kernel name | +prop. step | +data layout | +addr. | +parallel | +blocking | +B_l [B/FLUP] | +
---|---|---|---|---|---|---|
push-soa | +OS | +SoA | +D | +x | +-- | +456 | +
push-aos | +OS | +AoS | +D | +x | +-- | +456 | +
pull-soa | +OS | +SoA | +D | +x | +-- | +456 | +
pull-aos | +OS | +AoS | +D | +x | +-- | +456 | +
blk-push-soa | +OS | +SoA | +D | +x | +x | +456 | +
blk-push-aos | +OS | +AoS | +D | +x | +x | +456 | +
blk-pull-soa | +OS | +SoA | +D | +x | +x | +456 | +
blk-pull-aos | +OS | +AoS | +D | +x | +x | +456 | +
aa-soa | +AA | +SoA | +D | +x | +x | +304 | +
aa-aos | +AA | +AoS | +D | +x | +x | +304 | +
aa-vec-soa | +AA | +SoA | +D | +x | +x | +304 | +
aa-vec-sl-soa | +AA | +SoA | +D | +x | +x | +304 | +
list-push-soa | +OS | +SoA | +I | +x | +x | +528 | +
list-push-aos | +OS | +AoS | +I | +x | +x | +528 | +
list-pull-soa | +OS | +SoA | +I | +x | +x | +528 | +
list-pull-aos | +OS | +AoS | +I | +x | +x | +528 | +
list-pull-split-nt-1s | +OS | +SoA | +I | +x | +x | +376 | +
list-pull-split-nt-2s | +OS | +SoA | +I | +x | +x | +376 | +
list-aa-soa | +AA | +SoA | +I | +x | +x | +340 | +
list-aa-aos | +AA | +AoS | +I | +x | +x | +340 | +
list-aa-ria-soa | +AA | +SoA | +I | +x | +x | +304-342 | +
list-aa-pv-soa | +AA | +SoA | +I | +x | +x | +304-342 | +
Correct benchmarking is a nontrivial task. Whenever benchmark results should be +created make sure the binary was compiled with:
+For the Intel compiler one can specify depending on the target ISA extension:
+Compiling for an architecture supporting AVX (Sandy Bridge, Ivy Bridge):
++make ISA=avx TARCH=-xAVX ++
Compiling for an architecture supporting AVX2 (Haswell, Broadwell):
++make ISA=avx TARCH=-xCORE-AVX2,-fma ++
WARNING: ISA is here still set to avx as currently we have the FMA intrinsics not +implemented. This might change in the future.
+Compiling for an architecture supporting AVX-512 (Skylake):
++make ISA=avx TARCH=-xCORE-AVX512 ++
WARNING: ISA is here still set to avx as currently we have no implementation for the +AVX512 intrinsics. This might change in the future.
+During benchmarking pinning should be used via the -pin parameter. Running +a benchmark with 10 threads and pin them to the first 10 cores works like
++$ bin/lbmbenchk-linux-intel-release-dp ... -t 10 -pin $(seq -s , 0 9) ++
Things the binary does nor check or control:
+With correct padding cache and TLB thrashing can be avoided. Therefore the +number of (fluid) nodes used in the data layout is artificially increased.
+Currently automatic padding is active for kernels which support it. It can be +controlled via the kernel parameter (i.e. parameter after the --) +-pad. Supported values are auto (default), no (to disable padding), +or a manual padding.
+Automatic padding tries to avoid cache and TLB thrashing and pads for a 32 +entry (huge pages) TLB with 8 sets and a 512 set (L2) cache. This reflects the +parameters of current Intel based processors.
+Manual padding is done via a padding string and has the format +mod_1+offset_1(,mod_n+offset_n), which specifies numbers of bytes. +SoA data layouts can exhibit TLB thrashing. Therefore we want to distribute the +19 pages with one lattice (36 with two lattices) we are concurrently accessing +over as much sets in the TLB as possible. +This is controlled by the distance between the accessed pages, which is the +number of (fluid) nodes in between them and can be adjusted by adding further +(fluid) nodes. +We want the distance d (in bytes) between two accessed pages to be e.g. +d % (PAGE_SIZE * TLB_SETS) = PAGE_SIZE. +This would distribute the pages evenly over the sets. Hereby PAGE_SIZE * TLB_SETS +would be our mod_1 and PAGE_SIZE (after the =) our offset_1. +Measurements show that with only a quarter of half of a page size as offset +higher performance is achieved, which is done by automatic padding. +On top of this padding more paddings can be added. They are just added to the +padding string and are separated by commas.
+A zero modulus in the padding string has a special meaning. Here the +corresponding offset is just added to the number of nodes. A padding string +like -pad 0+16 would at a static padding of two nodes (one node = 8 b).
+TODO: supported geometries: channel, pipe, blocks, fluid
+The sections lists performance values measured on several machines for +different kernels and geometries and double precision floating point data/arithmetic. +The RFM column denotes the expected performance as predicted by the +Roofline performance model [williams-2008]. +For performance prediction of each kernel a memory bandwidth benchmark is used +which mimics the kernels memory access pattern and the kernel's loop balance +(see [kernels] for details).
+Ivy Bridge, Intel Xeon E5-2660 v2
+Haswell, Intel Xeon E5-2695 v3
+Broadwell, Intel Xeon E5-2630 v4
+Skylake, Intel Xeon Gold 6148
+NOTE: currently we only use AVX2 intrinsics.
+Zen, AMD EPYC 7451
+Zen, AMD Ryzen 7 1700X
+Ivy Bridge, Intel Xeon E5-2660 v2, Double Precision | +
+ |
Ivy Bridge, Intel Xeon E5-2660 v2, Single Precision | +
+ |
Haswell, Intel Xeon E5-2695 v3, Double Precision | +
+ |
Haswell, Intel Xeon E5-2695 v3, Single Precision | +
+ |
Broadwell, Intel Xeon E5-2630 v4, Double Precision | +
+ |
Broadwell, Intel Xeon E5-2630 v4, Single Precision | +
+ |
Skylake, Intel Xeon Gold 6148, Double Precision, NOTE: currently we only use AVX2 intrinsics. | +
+ |
Skylake, Intel Xeon Gold 6148, Single Precision, NOTE: currently we only use AVX2 intrinsics. | +
+ |
Zen, AMD Ryzen 7 1700X, Double Precision | +
+ |
Zen, AMD Ryzen 7 1700X, Single Precision | +
+ |
Zen, AMD EPYC 7451, Double Precision | +
+ |
Zen, AMD EPYC 7451, Single Precision | +
+ |
The Lattice Boltzmann Benchmark Kernels are licensed under GPLv3.
+This work was funded by BMBF, grant no. 01IH15003A (project SKAMPY).
+This work was funded by KONWHIR project OMI4PAPS.
+[ginzburg-2008] | I. Ginzburg, F. Verhaeghe, and D. d'Humières. +Two-relaxation-time lattice Boltzmann scheme: About parametrization, velocity, pressure and mixed boundary conditions. +Commun. Comput. Phys., 3(2):427-478, 2008. |
[williams-2008] | S. Williams, A. Waterman, and D. Patterson. +Roofline: an insightful visual performance model for multicore architectures. +Commun. ACM, 52(4):65-76, Apr 2009. doi:10.1145/1498765.1498785 |
Document was generated at 2018-01-09 11:54.
+