Commit | Line | Data |
---|---|---|
0fde6e45 MW |
1 | |
2 | | Copyright | |
3 | | Markus Wittmann, 2016-2018 | |
4 | | RRZE, University of Erlangen-Nuremberg, Germany | |
5 | | markus.wittmann -at- fau.de or hpc -at- rrze.fau.de | |
6 | | | |
7 | | Viktor Haag, 2016 | |
8 | | LSS, University of Erlangen-Nuremberg, Germany | |
9 | | | |
8cafd9ea MW |
10 | | Michael Hussnaetter, 2017-2018 |
11 | | University of Erlangen-Nuremberg, Germany | |
12 | | michael.hussnaetter -at- fau.de | |
13 | | | |
0fde6e45 MW |
14 | | This file is part of the Lattice Boltzmann Benchmark Kernels (LbmBenchKernels). |
15 | | | |
16 | | LbmBenchKernels is free software: you can redistribute it and/or modify | |
17 | | it under the terms of the GNU General Public License as published by | |
18 | | the Free Software Foundation, either version 3 of the License, or | |
19 | | (at your option) any later version. | |
20 | | | |
21 | | LbmBenchKernels is distributed in the hope that it will be useful, | |
22 | | but WITHOUT ANY WARRANTY; without even the implied warranty of | |
23 | | MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the | |
24 | | GNU General Public License for more details. | |
25 | | | |
26 | | You should have received a copy of the GNU General Public License | |
27 | | along with LbmBenchKernels. If not, see <http://www.gnu.org/licenses/>. | |
10988083 MW |
28 | |
29 | .. title:: LBM Benchmark Kernels Documentation | |
30 | ||
31 | ||
0fde6e45 | 32 | **LBM Benchmark Kernels Documentation** |
10988083 MW |
33 | |
34 | .. sectnum:: | |
35 | .. contents:: | |
36 | ||
0095f461 MW |
37 | Introduction |
38 | ============ | |
39 | ||
40 | The lattice Boltzmann (LBM) benchmark kernels are a collection of LBM kernel | |
41 | implementations. | |
42 | ||
43 | **AS SUCH THE LBM BENCHMARK KERNELS ARE NO FULLY EQUIPPED CFD SOLVER AND SOLELY | |
44 | SERVES THE PURPOSE OF STUDYING POSSIBLE PERFORMANCE OPTIMIZATIONS AND/OR | |
45 | EXPERIMENTS.** | |
46 | ||
47 | Currently all kernels utilize a D3Q19 discretization and the | |
48 | two-relaxation-time (TRT) collision operator [ginzburg-2008]_. | |
0fde6e45 | 49 | All operations are carried out in double or single precision arithmetic. |
0095f461 | 50 | |
10988083 MW |
51 | Compilation |
52 | =========== | |
53 | ||
54 | The benchmark framework currently supports only Linux systems and the GCC and | |
55 | Intel compilers. Every other configuration probably requires adjustment inside | |
0095f461 | 56 | the code and the makefiles. Furthermore some code might be platform or at least |
10988083 MW |
57 | POSIX specific. |
58 | ||
59 | The benchmark can be build via ``make`` from the ``src`` subdirectory. This will | |
60 | generate one binary which hosts all implemented benchmark kernels. | |
61 | ||
62 | Binaries are located under the ``bin`` subdirectory and will have different names | |
63 | depending on compiler and build configuration. | |
64 | ||
0095f461 MW |
65 | Compilation can target debug or release builds. Combined with both build types |
66 | verification can be enabled, which increases the runtime and hence is not | |
67 | suited for benchmarking. | |
68 | ||
69 | ||
10988083 MW |
70 | Debug and Verification |
71 | ---------------------- | |
72 | ||
73 | :: | |
74 | ||
e3f82424 | 75 | make BUILD=debug BENCHMARK=off |
10988083 | 76 | |
e3f82424 | 77 | Running ``make`` with ``BUILD=debug`` builds the debug version of |
10988083 MW |
78 | the benchmark kernels, where no optimizations are performed, line numbers and |
79 | debug symbols are included as well as ``DEBUG`` will be defined. The resulting | |
80 | binary will be found in the ``bin`` subdirectory and named | |
81 | ``lbmbenchk-linux-<compiler>-debug``. | |
82 | ||
e3f82424 MW |
83 | Specifying ``BENCHMARK=off`` turns on verification |
84 | (``VERIFICATION=on``), statistics (``STATISTICS=on``), and VTK output | |
10988083 MW |
85 | (``VTK_OUTPUT=on``) enabled. |
86 | ||
87 | Please note that the generated binary will therefore | |
88 | exhibit a poor performance. | |
89 | ||
0095f461 MW |
90 | |
91 | Release and Verification | |
92 | ------------------------ | |
93 | ||
94 | Verification with the debug builds can be extremely slow. Hence verification | |
95 | capabilities can be build with release builds: :: | |
96 | ||
97 | make BENCHMARK=off | |
98 | ||
99 | ||
10988083 MW |
100 | Benchmarking |
101 | ------------ | |
102 | ||
103 | To generate a binary for benchmarking run make with :: | |
104 | ||
e3f82424 | 105 | make |
10988083 | 106 | |
e3f82424 | 107 | As default ``BENCHMARK=on`` and ``BUILD=release`` is set, where |
0095f461 | 108 | ``BUILD=release`` turns optimizations on and ``BENCHMARK=on`` disables |
10988083 MW |
109 | verfification, statistics, and VTK output. |
110 | ||
0095f461 MW |
111 | See Options Summary below for further description of options which can be |
112 | applied, e.g. TARCH as well as the Benchmarking section. | |
10988083 MW |
113 | |
114 | Compilers | |
115 | --------- | |
116 | ||
117 | Currently only the GCC and Intel compiler under Linux are supported. Between | |
118 | both configuration can be chosen via ``CONFIG=linux-gcc`` or | |
119 | ``CONFIG=linux-intel``. | |
120 | ||
e3f82424 | 121 | |
0fde6e45 MW |
122 | Floating Point Precision |
123 | ------------------------ | |
124 | ||
125 | As default double precision data types are used for storing PDFs and floating | |
126 | point constants. Furthermore, this is the default for the intrincis kernels. | |
127 | With the ``PRECISION=sp`` variable this can be changed to single precision. :: | |
128 | ||
129 | make PRECISION=sp # build for single precision kernels | |
130 | ||
131 | make PRECISION=dp # build for double precision kernels (defalt) | |
132 | ||
133 | ||
e3f82424 MW |
134 | Cleaning |
135 | -------- | |
136 | ||
137 | For each configuration and build (debug/release) a subdirectory under the | |
138 | ``src/obj`` directory is created where the dependency and object files are | |
139 | stored. | |
140 | With :: | |
141 | ||
142 | make CONFIG=... BUILD=... clean | |
143 | ||
144 | a specific combination is select and cleaned, whereas with :: | |
145 | ||
146 | make clean-all | |
147 | ||
148 | all object and dependency files are deleted. | |
149 | ||
150 | ||
10988083 MW |
151 | Options Summary |
152 | --------------- | |
153 | ||
0095f461 | 154 | Options that can be specified when building the suite with make: |
10988083 MW |
155 | |
156 | ============= ======================= ============ ========================================================== | |
157 | name values default description | |
0095f461 | 158 | ============= ======================= ============ ========================================================== |
e3f82424 | 159 | BENCHMARK on, off on If enabled, disables VERIFICATION, STATISTICS, VTK_OUTPUT. If disabled enables the three former options. |
0095f461 | 160 | BUILD debug, release release debug: no optimization, debug symbols, DEBUG defined. release: optimizations enabled. |
10988083 | 161 | CONFIG linux-gcc, linux-intel linux-intel Select GCC or Intel compiler. |
8cafd9ea MW |
162 | ISA avx512, avx, sse avx Determines which ISA extension is used for macro definitions of the intrinsics. This is *not* the architecture the compiler generates code for. |
163 | OPENMP on, off on OpenMP, i.e. threading support. | |
0fde6e45 | 164 | PRECISION dp, sp dp Floating point precision used for data type, arithmetic, and intrincics. |
10988083 | 165 | STATISTICS on, off off View statistics, like density etc, during simulation. |
e3f82424 | 166 | TARCH -- -- Via TARCH the architecture the compiler generates code for can be overridden. The value depends on the chosen compiler. |
10988083 MW |
167 | VERIFICATION on, off off Turn verification on/off. |
168 | VTK_OUTPUT on, off off Enable/Disable VTK file output. | |
169 | ============= ======================= ============ ========================================================== | |
170 | ||
8cafd9ea MW |
171 | **Suboptions for ``ISA=avx512``** |
172 | ||
173 | ============================== ======== ======== ====================== | |
174 | name values default description | |
175 | ============================== ======== ======== ====================== | |
176 | ADJ_LIST_MEM_TYPE HBM - Determines memory location of adjacency list array, DRAM or HBM. | |
177 | PDF_MEM_TYPE HBM - Determines memory location of PDF array, DRAM or HBM. | |
178 | SOFTWARE_PREFETCH_LOOKAHEAD_L1 int >= 0 0 Software prefetch lookahead of elements into L1 cache, value is multiplied by vector size (``VSIZE``). | |
179 | SOFTWARE_PREFETCH_LOOKAHEAD_L2 int >= 0 0 Software prefetch lookahead of elements into L2 cache, value is multiplied by vector size (``VSIZE``). | |
180 | ============================== ======== ======== ====================== | |
181 | ||
182 | Please note this options require AVX-512 PF support of the target processor. | |
183 | ||
10988083 MW |
184 | Invocation |
185 | ========== | |
186 | ||
e3f82424 | 187 | Running the binary will print among the GPL licence header a line like the following: :: |
10988083 MW |
188 | |
189 | LBM Benchmark Kernels 0.1, compiled Jul 5 2017 21:59:22, type: verification | |
190 | ||
e3f82424 | 191 | if verfication was enabled during compilation or :: |
10988083 MW |
192 | |
193 | LBM Benchmark Kernels 0.1, compiled Jul 5 2017 21:59:22, type: benchmark | |
194 | ||
195 | if verfication was disabled during compilation. | |
196 | ||
197 | Command Line Parameters | |
198 | ----------------------- | |
199 | ||
200 | Running the binary with ``-h`` list all available parameters: :: | |
201 | ||
202 | Usage: | |
203 | ./lbmbenchk -list | |
204 | ./lbmbenchk | |
8cafd9ea | 205 | [-dims XxYxZ] [-geometry box|channel|pipe|blocks[-<block size>]] [-iterations <iterations>] [-lattice-dump-ascii] |
10988083 MW |
206 | [-rho-in <density>] [-rho-out <density] [-omega <omega>] [-kernel <kernel>] |
207 | [-periodic-x] | |
208 | [-t <number of threads>] | |
209 | [-pin core{,core}*] | |
210 | [-verify] | |
211 | -- <kernel specific parameters> | |
212 | ||
213 | -list List available kernels. | |
214 | ||
215 | -dims XxYxZ Specify geometry dimensions. | |
216 | ||
217 | -geometry blocks-<block size> | |
218 | Geometetry with blocks of size <block size> regularily layout out. | |
219 | ||
220 | ||
221 | If an option is specified multiple times the last one overrides previous ones. | |
222 | This holds also true for ``-verify`` which sets geometry dimensions, | |
223 | iterations, etc, which can afterward be override, e.g.: :: | |
224 | ||
0fde6e45 | 225 | $ bin/lbmbenchk-linux-intel-release-dp -verfiy -dims 32x32x32 |
10988083 | 226 | |
0095f461 | 227 | Kernel specific parameters can be obtained via selecting the specific kernel |
10988083 MW |
228 | and passing ``-h`` as parameter: :: |
229 | ||
0fde6e45 | 230 | $ bin/lbmbenchk-linux-intel-release-dp -kernel kernel-name -- -h |
10988083 MW |
231 | ... |
232 | Kernel parameters: | |
233 | [-blk <n>] [-blk-[xyz] <n>] | |
234 | ||
235 | ||
236 | A list of all available kernels can be obtained via ``-list``: :: | |
237 | ||
0fde6e45 | 238 | $ ../bin/lbmbenchk-linux-gcc-debug-dp -list |
10988083 MW |
239 | Lattice Boltzmann Benchmark Kernels (LbmBenchKernels) Copyright (C) 2016, 2017 LSS, RRZE |
240 | This program comes with ABSOLUTELY NO WARRANTY; for details see LICENSE. | |
241 | This is free software, and you are welcome to redistribute it under certain conditions. | |
242 | ||
243 | LBM Benchmark Kernels 0.1, compiled Jul 5 2017 21:59:22, type: verification | |
244 | Available kernels to benchmark: | |
245 | list-aa-pv-soa | |
246 | list-aa-ria-soa | |
247 | list-aa-soa | |
248 | list-aa-aos | |
249 | list-pull-split-nt-1s-soa | |
250 | list-pull-split-nt-2s-soa | |
251 | list-push-soa | |
252 | list-push-aos | |
253 | list-pull-soa | |
254 | list-pull-aos | |
255 | push-soa | |
256 | push-aos | |
257 | pull-soa | |
258 | pull-aos | |
259 | blk-push-soa | |
260 | blk-push-aos | |
261 | blk-pull-soa | |
262 | blk-pull-aos | |
263 | ||
e3f82424 MW |
264 | Kernels |
265 | ------- | |
266 | ||
267 | The following list shortly describes available kernels: | |
268 | ||
0fde6e45 | 269 | - **push-soa/push-aos/pull-soa/pull-aos**: |
e3f82424 MW |
270 | Unoptimized kernels (but stream/collide are already fused) using two grids as |
271 | source and destination. Implement push/pull semantics as well structure of | |
272 | arrays (soa) or array of structures (aos) layout. | |
273 | ||
0fde6e45 | 274 | - **blk-push-soa/blk-push-aos/blk-pull-soa/blk-pull-aos**: |
e3f82424 MW |
275 | The same as the unoptimized kernels without the blk prefix, except that they support |
276 | spatial blocking, i.e. loop blocking of the three loops used to iterate over | |
277 | the lattice. Here manual work sharing for OpenMP is used. | |
278 | ||
0fde6e45 MW |
279 | - **aa-aos/aa-soa**: |
280 | Straight forward implementation of AA pattern on full array with blocking support. | |
281 | Manual work sharing for OpenMP is used. Domain is partitioned only along the x dimension. | |
282 | ||
283 | - **aa-vec-soa/aa-vec-sl-soa**: | |
284 | Optimized AA kernel with intrinsics on full array. aa-vec-sl-soa uses only | |
285 | one loop for iterating over the lattice instead of three nested ones. | |
286 | ||
287 | - **list-push-soa/list-push-aos/list-pull-soa/list-pull-aos**: | |
e3f82424 MW |
288 | The same as the unoptimized kernels without the list prefix, but for indirect addressing. |
289 | Here only a 1D vector of is used to store the fluid nodes, omitting the | |
290 | obstacles. An adjacency list is used to recover the neighborhood associations. | |
291 | ||
0fde6e45 | 292 | - **list-pull-split-nt-1s-soa/list-pull-split-nt-2s-soa**: |
e3f82424 MW |
293 | Optimized variant of list-pull-soa. Chunks of the lattice are processed as |
294 | once. Postcollision values are written back via nontemporal stores in 18 (1s) | |
295 | or 9 (2s) loops. | |
296 | ||
0fde6e45 | 297 | - **list-aa-aos/list-aa-soa**: |
e3f82424 MW |
298 | Unoptimized implementation of the AA pattern for the 1D vector with adjacency |
299 | list. Supported are array of structures (aos) and structure of arrays (soa) | |
300 | data layout is supported. | |
301 | ||
0fde6e45 | 302 | - **list-aa-ria-soa**: |
e3f82424 MW |
303 | Implementation of AA pattern with intrinsics for the 1D vector with adjacency |
304 | list. Furthermore it contains a vectorized even time step and run length | |
305 | coding to reduce the loop balance of the odd time step. | |
306 | ||
0fde6e45 | 307 | - **list-aa-pv-soa**: |
e3f82424 MW |
308 | All optimizations of list-aa-ria-soa. Additional with partial vectorization |
309 | of the odd time step. | |
310 | ||
311 | ||
312 | Note that all array of structures (aos) kernels might require blocking | |
313 | (depending on the domain size) to reach the performance of their structure of | |
314 | arrays (soa) counter parts. | |
315 | ||
316 | The following table summarizes the properties of the kernels. Here **D** means | |
317 | direct addressing, i.e. full array, **I** means indirect addressing, i.e. 1D | |
318 | vector with adjacency list, **x** means supported, whereas **--** means unsupported. | |
0fde6e45 | 319 | The loop balance B_l is computed for D3Q19 model with **double precision** floating |
e3f82424 MW |
320 | point for PDFs (8 byte) and 4 byte integers for the index (adjacency list). |
321 | As list-aa-ria-soa and list-aa-pv-soa support run length coding their effective | |
322 | loop balance depends on the geometry. The effective loop balance is printed | |
323 | during each run. | |
324 | ||
325 | ||
326 | ====================== =========== =========== ===== ======== ======== ============ | |
327 | kernel name prop. step data layout addr. parallel blocking B_l [B/FLUP] | |
328 | ====================== =========== =========== ===== ======== ======== ============ | |
329 | push-soa OS SoA D x -- 456 | |
330 | push-aos OS AoS D x -- 456 | |
331 | pull-soa OS SoA D x -- 456 | |
332 | pull-aos OS AoS D x -- 456 | |
333 | blk-push-soa OS SoA D x x 456 | |
334 | blk-push-aos OS AoS D x x 456 | |
335 | blk-pull-soa OS SoA D x x 456 | |
336 | blk-pull-aos OS AoS D x x 456 | |
0fde6e45 MW |
337 | aa-soa AA SoA D x x 304 |
338 | aa-aos AA AoS D x x 304 | |
339 | aa-vec-soa AA SoA D x x 304 | |
340 | aa-vec-sl-soa AA SoA D x x 304 | |
e3f82424 MW |
341 | list-push-soa OS SoA I x x 528 |
342 | list-push-aos OS AoS I x x 528 | |
343 | list-pull-soa OS SoA I x x 528 | |
344 | list-pull-aos OS AoS I x x 528 | |
345 | list-pull-split-nt-1s OS SoA I x x 376 | |
346 | list-pull-split-nt-2s OS SoA I x x 376 | |
347 | list-aa-soa AA SoA I x x 340 | |
348 | list-aa-aos AA AoS I x x 340 | |
349 | list-aa-ria-soa AA SoA I x x 304-342 | |
350 | list-aa-pv-soa AA SoA I x x 304-342 | |
351 | ====================== =========== =========== ===== ======== ======== ============ | |
10988083 MW |
352 | |
353 | Benchmarking | |
354 | ============ | |
355 | ||
356 | Correct benchmarking is a nontrivial task. Whenever benchmark results should be | |
357 | created make sure the binary was compiled with: | |
358 | ||
e3f82424 MW |
359 | - ``BENCHMARK=on`` (default if not overriden) and |
360 | - ``BUILD=release`` (default if not overriden) and | |
8cafd9ea | 361 | - the correct ISA for macros (i.e. intrinsics) is used, selected via ``ISA`` and |
10988083 | 362 | - use ``TARCH`` to specify the architecture the compiler generates code for. |
0095f461 MW |
363 | |
364 | Intel Compiler | |
365 | -------------- | |
366 | ||
367 | For the Intel compiler one can specify depending on the target ISA extension: | |
368 | ||
8cafd9ea | 369 | - SSE: ``TARCH=-xSSE4.2`` |
0095f461 MW |
370 | - AVX: ``TARCH=-xAVX`` |
371 | - AVX2 and FMA: ``TARCH=-xCORE-AVX2,-fma`` | |
372 | - AVX512: ``TARCH=-xCORE-AVX512`` | |
373 | - KNL: ``TARCH=-xMIC-AVX512`` | |
374 | ||
375 | Compiling for an architecture supporting AVX (Sandy Bridge, Ivy Bridge): :: | |
376 | ||
377 | make ISA=avx TARCH=-xAVX | |
378 | ||
379 | ||
380 | Compiling for an architecture supporting AVX2 (Haswell, Broadwell): :: | |
381 | ||
382 | make ISA=avx TARCH=-xCORE-AVX2,-fma | |
383 | ||
384 | WARNING: ISA is here still set to ``avx`` as currently we have the FMA intrinsics not | |
385 | implemented. This might change in the future. | |
386 | ||
387 | ||
8cafd9ea MW |
388 | .. TODO: add isa=avx512 and add docu for knl |
389 | ||
390 | .. TODO: kein prefetching wenn AVX-512 PF nicht unterstuetz wird | |
391 | ||
0095f461 MW |
392 | Compiling for an architecture supporting AVX-512 (Skylake): :: |
393 | ||
8cafd9ea MW |
394 | make ISA=avx512 TARCH=-xCORE-AVX512 |
395 | ||
396 | Please note that for the AVX512 gather kernels software prefetching for the | |
397 | gather instructions is disabled per default. | |
398 | To enable it set ``SOFTWARE_PREFETCH_LOOKAHEAD_L1`` and/or | |
399 | ``SOFTWARE_PREFETCH_LOOKAHEAD_L2`` to a value greater than ``0`` during | |
400 | compilation. Note that this requires AVX-512 PF support from the target | |
401 | processor. | |
402 | ||
403 | Compiling for MIC architecture KNL supporting AVX-512 and AVX-512 PF:: | |
404 | ||
405 | make ISA=avx512 TARCH=-xMIC-AVX512 | |
406 | ||
407 | or optionally with software prefetch enabled:: | |
408 | ||
409 | make ISA=avx512 TARCH=-xMIC-AVX512 SOFTWARE_PREFETCH_LOOKAHEAD_L1=<value> SOFTWARE_PREFETCH_LOOKAHEAD_L2=<value> | |
410 | ||
0095f461 | 411 | |
0095f461 MW |
412 | |
413 | ||
414 | Pinning | |
415 | ------- | |
10988083 MW |
416 | |
417 | During benchmarking pinning should be used via the ``-pin`` parameter. Running | |
0095f461 | 418 | a benchmark with 10 threads and pin them to the first 10 cores works like :: |
10988083 | 419 | |
0fde6e45 | 420 | $ bin/lbmbenchk-linux-intel-release-dp ... -t 10 -pin $(seq -s , 0 9) |
10988083 | 421 | |
0095f461 MW |
422 | |
423 | General Remarks | |
424 | --------------- | |
425 | ||
426 | Things the binary does nor check or control: | |
10988083 MW |
427 | |
428 | - transparent huge pages: when allocating memory small 4 KiB pages might be | |
429 | replaced with larger ones. This is in general a good thing, but if this is | |
e3f82424 MW |
430 | really the case, depends on the system settings (check e.g. the status of |
431 | ``/sys/kernel/mm/transparent_hugepage/enabled``). | |
432 | Currently ``madvise(MADV_HUGEPAGE)`` is used for allocations which are aligned to | |
433 | a 4 KiB page, which should be the case for the lattices. | |
434 | This should result in huge pages except THP is disabled on the machine. | |
435 | (NOTE: madvise() is used if ``HAVE_HUGE_PAGES`` is defined, which is currently | |
436 | hard coded defined in ``Memory.c``). | |
10988083 MW |
437 | |
438 | - CPU/core frequency: For reproducible results the frequency of all cores | |
439 | should be fixed. | |
440 | ||
441 | - NUMA placement policy: The benchmark assumes a first touch policy, which | |
442 | means the memory will be placed at the NUMA domain the touching core is | |
443 | associated with. If a different policy is in place or the NUMA domain to be | |
444 | used is already full memory might be allocated in a remote domain. Accesses | |
445 | to remote domains typically have a higher latency and lower bandwidth. | |
446 | ||
0095f461 | 447 | - System load: interference with other application, especially on desktop |
10988083 MW |
448 | systems should be avoided. |
449 | ||
e3f82424 MW |
450 | - Padding: For SoA based kernels the number of (fluid) nodes is automatically |
451 | adjusted so that no cache or TLB thrashing should occur. The parameters are | |
452 | optimized for current Intel based systems. For more details look into the | |
453 | padding section. | |
10988083 MW |
454 | |
455 | - CPU dispatcher function: the compiler might add different versions of a | |
456 | function for different ISA extensions. Make sure the code you might think is | |
457 | executed is actually the code which is executed. | |
458 | ||
e3f82424 MW |
459 | Padding |
460 | ------- | |
461 | ||
462 | With correct padding cache and TLB thrashing can be avoided. Therefore the | |
463 | number of (fluid) nodes used in the data layout is artificially increased. | |
464 | ||
465 | Currently automatic padding is active for kernels which support it. It can be | |
466 | controlled via the kernel parameter (i.e. parameter after the ``--``) | |
467 | ``-pad``. Supported values are ``auto`` (default), ``no`` (to disable padding), | |
468 | or a manual padding. | |
469 | ||
470 | Automatic padding tries to avoid cache and TLB thrashing and pads for a 32 | |
471 | entry (huge pages) TLB with 8 sets and a 512 set (L2) cache. This reflects the | |
472 | parameters of current Intel based processors. | |
473 | ||
474 | Manual padding is done via a padding string and has the format | |
475 | ``mod_1+offset_1(,mod_n+offset_n)``, which specifies numbers of bytes. | |
476 | SoA data layouts can exhibit TLB thrashing. Therefore we want to distribute the | |
477 | 19 pages with one lattice (36 with two lattices) we are concurrently accessing | |
478 | over as much sets in the TLB as possible. | |
479 | This is controlled by the distance between the accessed pages, which is the | |
480 | number of (fluid) nodes in between them and can be adjusted by adding further | |
481 | (fluid) nodes. | |
482 | We want the distance d (in bytes) between two accessed pages to be e.g. | |
483 | **d % (PAGE_SIZE * TLB_SETS) = PAGE_SIZE**. | |
484 | This would distribute the pages evenly over the sets. Hereby **PAGE_SIZE * TLB_SETS** | |
485 | would be our ``mod_1`` and **PAGE_SIZE** (after the =) our ``offset_1``. | |
486 | Measurements show that with only a quarter of half of a page size as offset | |
487 | higher performance is achieved, which is done by automatic padding. | |
488 | On top of this padding more paddings can be added. They are just added to the | |
489 | padding string and are separated by commas. | |
490 | ||
491 | A zero modulus in the padding string has a special meaning. Here the | |
492 | corresponding offset is just added to the number of nodes. A padding string | |
493 | like ``-pad 0+16`` would at a static padding of two nodes (one node = 8 b). | |
494 | ||
495 | ||
496 | Geometries | |
497 | ========== | |
498 | ||
0095f461 MW |
499 | TODO: supported geometries: channel, pipe, blocks, fluid |
500 | ||
501 | ||
502 | Performance Results | |
503 | =================== | |
504 | ||
505 | The sections lists performance values measured on several machines for | |
0fde6e45 | 506 | different kernels and geometries and **double precision** floating point data/arithmetic. |
0095f461 MW |
507 | The **RFM** column denotes the expected performance as predicted by the |
508 | Roofline performance model [williams-2008]_. | |
509 | For performance prediction of each kernel a memory bandwidth benchmark is used | |
510 | which mimics the kernels memory access pattern and the kernel's loop balance | |
511 | (see [kernels]_ for details). | |
512 | ||
0fde6e45 MW |
513 | Machine Specifications |
514 | ---------------------- | |
515 | ||
516 | **Ivy Bridge, Intel Xeon E5-2660 v2** | |
517 | ||
518 | - Ivy Bridge architecture, AVX | |
519 | - 10 cores, 2.2 GHz | |
520 | - SMT enabled | |
521 | - memoy bandwidth: | |
522 | ||
523 | - copy-19 32.7 GB/s | |
524 | - copy-19-nt-sl 35.6 GB/s | |
525 | - update-19 37.4 GB/s | |
526 | ||
527 | **Haswell, Intel Xeon E5-2695 v3** | |
0095f461 MW |
528 | |
529 | - Haswell architecture, AVX2, FMA | |
0fde6e45 | 530 | - 14 cores, 2.3 GHz |
0095f461 MW |
531 | - 2 x 7 cores in cluster-on-die (CoD) mode enabled |
532 | - SMT enabled | |
0fde6e45 | 533 | - memory bandwidth: |
0095f461 | 534 | |
0fde6e45 MW |
535 | - copy-19 47.3 GB/s |
536 | - copy-19-nt-sl 47.1 GB/s | |
537 | - update-19 44.0 GB/s | |
538 | ||
539 | ||
540 | **Broadwell, Intel Xeon E5-2630 v4** | |
0095f461 MW |
541 | |
542 | - Broadwell architecture, AVX2, FMA | |
543 | - 10 cores, 2.2 GHz | |
544 | - SMT disabled | |
0fde6e45 MW |
545 | - memory bandwidth: |
546 | ||
547 | - copy-19 48.0 GB/s | |
548 | - copy-nt-sl-19 48.2 GB/s | |
549 | - update-19 51.1 GB/s | |
550 | ||
551 | **Skylake, Intel Xeon Gold 6148** | |
0095f461 | 552 | |
0fde6e45 | 553 | - Skylake server architecture, AVX2, AVX512, 2 FMA units |
0095f461 MW |
554 | - 20 cores, 2.4 GHz |
555 | - SMT enabled | |
0fde6e45 MW |
556 | - memory bandwidth: |
557 | ||
558 | - copy-19 89.7 GB/s | |
559 | - copy-19-nt-sl 92.4 GB/s | |
560 | - update-19 93.6 GB/s | |
561 | ||
562 | **Zen, AMD EPYC 7451** | |
563 | ||
564 | - Zen architecture, AVX2, FMA | |
565 | - 24 cores, 2.3 GHz | |
566 | - SMT enabled | |
567 | - memory bandwidth: | |
568 | ||
569 | - copy-19 111.9 GB/s | |
570 | - copy-19-nt-sl 111.7 GB/s | |
571 | - update-19 109.2 GB/s | |
572 | ||
573 | **Zen, AMD Ryzen 7 1700X** | |
574 | ||
575 | - Zen architecture, AVX2, FMA | |
576 | - 8 cores, 3.4 GHz | |
577 | - SMT enabled | |
578 | - memory bandwidth: | |
579 | ||
580 | - copy-19 27.2 GB/s | |
581 | - copy-19-nt-sl 27.1 GB/s | |
582 | - update-19 26.1 GB/s | |
583 | ||
584 | Single Socket Results | |
585 | --------------------- | |
586 | ||
587 | - Geometry dimensions are for all measurements 500x100x100 nodes. | |
588 | - Note the **different scaling on the y axis** of the plots! | |
589 | ||
590 | .. |perf_emmy_dp| image:: images/benchmark-emmy-dp.png | |
591 | :scale: 50 % | |
592 | .. |perf_emmy_sp| image:: images/benchmark-emmy-sp.png | |
593 | :scale: 50 % | |
594 | .. |perf_hasep1_dp| image:: images/benchmark-hasep1-dp.png | |
595 | :scale: 50 % | |
596 | .. |perf_hasep1_sp| image:: images/benchmark-hasep1-sp.png | |
597 | :scale: 50 % | |
598 | .. |perf_meggie_dp| image:: images/benchmark-meggie-dp.png | |
599 | :scale: 50 % | |
600 | .. |perf_meggie_sp| image:: images/benchmark-meggie-sp.png | |
601 | :scale: 50 % | |
602 | .. |perf_skylakesp2_dp| image:: images/benchmark-skylakesp2-dp.png | |
603 | :scale: 50 % | |
604 | .. |perf_skylakesp2_sp| image:: images/benchmark-skylakesp2-sp.png | |
605 | :scale: 50 % | |
606 | .. |perf_summitridge1_dp| image:: images/benchmark-summitridge1-dp.png | |
607 | :scale: 50 % | |
608 | .. |perf_summitridge1_sp| image:: images/benchmark-summitridge1-sp.png | |
609 | :scale: 50 % | |
610 | .. |perf_naples1_dp| image:: images/benchmark-naples1-dp.png | |
611 | :scale: 50 % | |
612 | .. |perf_naples1_sp| image:: images/benchmark-naples1-sp.png | |
613 | :scale: 50 % | |
614 | ||
615 | .. list-table:: | |
616 | ||
617 | * - Ivy Bridge, Intel Xeon E5-2660 v2, Double Precision | |
618 | * - |perf_emmy_dp| | |
619 | * - Ivy Bridge, Intel Xeon E5-2660 v2, Single Precision | |
620 | * - |perf_emmy_sp| | |
621 | * - Haswell, Intel Xeon E5-2695 v3, Double Precision | |
622 | * - |perf_hasep1_dp| | |
623 | * - Haswell, Intel Xeon E5-2695 v3, Single Precision | |
624 | * - |perf_hasep1_sp| | |
625 | * - Broadwell, Intel Xeon E5-2630 v4, Double Precision | |
626 | * - |perf_meggie_dp| | |
627 | * - Broadwell, Intel Xeon E5-2630 v4, Single Precision | |
628 | * - |perf_meggie_sp| | |
9e0051cb | 629 | * - Skylake, Intel Xeon Gold 6148, Double Precision |
0fde6e45 | 630 | * - |perf_skylakesp2_dp| |
9e0051cb | 631 | * - Skylake, Intel Xeon Gold 6148, Single Precision |
0fde6e45 MW |
632 | * - |perf_skylakesp2_sp| |
633 | * - Zen, AMD Ryzen 7 1700X, Double Precision | |
634 | * - |perf_summitridge1_dp| | |
635 | * - Zen, AMD Ryzen 7 1700X, Single Precision | |
636 | * - |perf_summitridge1_sp| | |
637 | * - Zen, AMD EPYC 7451, Double Precision | |
638 | * - |perf_naples1_dp| | |
639 | * - Zen, AMD EPYC 7451, Single Precision | |
640 | * - |perf_naples1_sp| | |
0095f461 | 641 | |
e3f82424 MW |
642 | |
643 | Licence | |
644 | ======= | |
645 | ||
646 | The Lattice Boltzmann Benchmark Kernels are licensed under GPLv3. | |
647 | ||
4e91c4b6 MW |
648 | |
649 | Acknowledgements | |
650 | ================ | |
651 | ||
8b9da565 MW |
652 | If you use the benchmark kernels you can cite us: |
653 | ||
654 | M. Wittmann, V. Haag, T. Zeiser, H. Köstler, and G. Wellein: Lattice Boltzmann | |
655 | Benchmark Kernels as a Testbed for Performance Analysis, (2018), Computer & | |
656 | Fluids, Special Issue DSFD2017. doi:10.1016/j.compfluid.2018.03.030. | |
657 | ||
658 | Bibtex entry:: | |
659 | ||
660 | @article{wittmann-2018, | |
661 | author = {M. Wittmann and V. Haag and T. Zeiser and H. K\"ostler and G. Wellein}, | |
662 | title = {Lattice {B}oltzmann benchmark kernels as a testbed for performance analysis}, | |
663 | journal = {Computers \& Fluids}, | |
664 | year = {2018}, | |
665 | issn = {0045-7930}, | |
666 | doi = {10.1016/j.compfluid.2018.03.030}, | |
667 | } | |
668 | ||
4e91c4b6 MW |
669 | This work was funded by BMBF, grant no. 01IH15003A (project SKAMPY). |
670 | ||
671 | This work was funded by KONWHIR project OMI4PAPS. | |
672 | ||
673 | ||
0095f461 MW |
674 | Bibliography |
675 | ============ | |
676 | ||
677 | .. [ginzburg-2008] | |
678 | I. Ginzburg, F. Verhaeghe, and D. d'Humières. | |
679 | Two-relaxation-time lattice Boltzmann scheme: About parametrization, velocity, pressure and mixed boundary conditions. | |
680 | Commun. Comput. Phys., 3(2):427-478, 2008. | |
681 | ||
682 | .. [williams-2008] | |
683 | S. Williams, A. Waterman, and D. Patterson. | |
684 | Roofline: an insightful visual performance model for multicore architectures. | |
685 | Commun. ACM, 52(4):65-76, Apr 2009. doi:10.1145/1498765.1498785 | |
686 | ||
4e91c4b6 | 687 | |
10988083 MW |
688 | .. |datetime| date:: %Y-%m-%d %H:%M |
689 | ||
690 | Document was generated at |datetime|. | |
691 |