Commit | Line | Data |
---|---|---|
10988083 MW |
1 | .. # -------------------------------------------------------------------------- |
2 | # | |
3 | # Copyright | |
4 | # Markus Wittmann, 2016-2017 | |
5 | # RRZE, University of Erlangen-Nuremberg, Germany | |
6 | # markus.wittmann -at- fau.de or hpc -at- rrze.fau.de | |
7 | # | |
8 | # Viktor Haag, 2016 | |
9 | # LSS, University of Erlangen-Nuremberg, Germany | |
10 | # | |
11 | # This file is part of the Lattice Boltzmann Benchmark Kernels (LbmBenchKernels). | |
12 | # | |
13 | # LbmBenchKernels is free software: you can redistribute it and/or modify | |
14 | # it under the terms of the GNU General Public License as published by | |
15 | # the Free Software Foundation, either version 3 of the License, or | |
16 | # (at your option) any later version. | |
17 | # | |
18 | # LbmBenchKernels is distributed in the hope that it will be useful, | |
19 | # but WITHOUT ANY WARRANTY; without even the implied warranty of | |
20 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the | |
21 | # GNU General Public License for more details. | |
22 | # | |
23 | # You should have received a copy of the GNU General Public License | |
24 | # along with LbmBenchKernels. If not, see <http://www.gnu.org/licenses/>. | |
25 | # | |
26 | # -------------------------------------------------------------------------- | |
27 | ||
28 | .. title:: LBM Benchmark Kernels Documentation | |
29 | ||
30 | ||
31 | =================================== | |
32 | LBM Benchmark Kernels Documentation | |
33 | =================================== | |
34 | ||
35 | .. sectnum:: | |
36 | .. contents:: | |
37 | ||
0095f461 MW |
38 | Introduction |
39 | ============ | |
40 | ||
41 | The lattice Boltzmann (LBM) benchmark kernels are a collection of LBM kernel | |
42 | implementations. | |
43 | ||
44 | **AS SUCH THE LBM BENCHMARK KERNELS ARE NO FULLY EQUIPPED CFD SOLVER AND SOLELY | |
45 | SERVES THE PURPOSE OF STUDYING POSSIBLE PERFORMANCE OPTIMIZATIONS AND/OR | |
46 | EXPERIMENTS.** | |
47 | ||
48 | Currently all kernels utilize a D3Q19 discretization and the | |
49 | two-relaxation-time (TRT) collision operator [ginzburg-2008]_. | |
50 | All operations are carried out in double precision arithmetic. | |
51 | ||
10988083 MW |
52 | Compilation |
53 | =========== | |
54 | ||
55 | The benchmark framework currently supports only Linux systems and the GCC and | |
56 | Intel compilers. Every other configuration probably requires adjustment inside | |
0095f461 | 57 | the code and the makefiles. Furthermore some code might be platform or at least |
10988083 MW |
58 | POSIX specific. |
59 | ||
60 | The benchmark can be build via ``make`` from the ``src`` subdirectory. This will | |
61 | generate one binary which hosts all implemented benchmark kernels. | |
62 | ||
63 | Binaries are located under the ``bin`` subdirectory and will have different names | |
64 | depending on compiler and build configuration. | |
65 | ||
0095f461 MW |
66 | Compilation can target debug or release builds. Combined with both build types |
67 | verification can be enabled, which increases the runtime and hence is not | |
68 | suited for benchmarking. | |
69 | ||
70 | ||
10988083 MW |
71 | Debug and Verification |
72 | ---------------------- | |
73 | ||
74 | :: | |
75 | ||
e3f82424 | 76 | make BUILD=debug BENCHMARK=off |
10988083 | 77 | |
e3f82424 | 78 | Running ``make`` with ``BUILD=debug`` builds the debug version of |
10988083 MW |
79 | the benchmark kernels, where no optimizations are performed, line numbers and |
80 | debug symbols are included as well as ``DEBUG`` will be defined. The resulting | |
81 | binary will be found in the ``bin`` subdirectory and named | |
82 | ``lbmbenchk-linux-<compiler>-debug``. | |
83 | ||
e3f82424 MW |
84 | Specifying ``BENCHMARK=off`` turns on verification |
85 | (``VERIFICATION=on``), statistics (``STATISTICS=on``), and VTK output | |
10988083 MW |
86 | (``VTK_OUTPUT=on``) enabled. |
87 | ||
88 | Please note that the generated binary will therefore | |
89 | exhibit a poor performance. | |
90 | ||
0095f461 MW |
91 | |
92 | Release and Verification | |
93 | ------------------------ | |
94 | ||
95 | Verification with the debug builds can be extremely slow. Hence verification | |
96 | capabilities can be build with release builds: :: | |
97 | ||
98 | make BENCHMARK=off | |
99 | ||
100 | ||
10988083 MW |
101 | Benchmarking |
102 | ------------ | |
103 | ||
104 | To generate a binary for benchmarking run make with :: | |
105 | ||
e3f82424 | 106 | make |
10988083 | 107 | |
e3f82424 | 108 | As default ``BENCHMARK=on`` and ``BUILD=release`` is set, where |
0095f461 | 109 | ``BUILD=release`` turns optimizations on and ``BENCHMARK=on`` disables |
10988083 MW |
110 | verfification, statistics, and VTK output. |
111 | ||
0095f461 MW |
112 | See Options Summary below for further description of options which can be |
113 | applied, e.g. TARCH as well as the Benchmarking section. | |
10988083 MW |
114 | |
115 | Compilers | |
116 | --------- | |
117 | ||
118 | Currently only the GCC and Intel compiler under Linux are supported. Between | |
119 | both configuration can be chosen via ``CONFIG=linux-gcc`` or | |
120 | ``CONFIG=linux-intel``. | |
121 | ||
e3f82424 MW |
122 | |
123 | Cleaning | |
124 | -------- | |
125 | ||
126 | For each configuration and build (debug/release) a subdirectory under the | |
127 | ``src/obj`` directory is created where the dependency and object files are | |
128 | stored. | |
129 | With :: | |
130 | ||
131 | make CONFIG=... BUILD=... clean | |
132 | ||
133 | a specific combination is select and cleaned, whereas with :: | |
134 | ||
135 | make clean-all | |
136 | ||
137 | all object and dependency files are deleted. | |
138 | ||
139 | ||
10988083 MW |
140 | Options Summary |
141 | --------------- | |
142 | ||
0095f461 | 143 | Options that can be specified when building the suite with make: |
10988083 MW |
144 | |
145 | ============= ======================= ============ ========================================================== | |
146 | name values default description | |
0095f461 | 147 | ============= ======================= ============ ========================================================== |
e3f82424 | 148 | BENCHMARK on, off on If enabled, disables VERIFICATION, STATISTICS, VTK_OUTPUT. If disabled enables the three former options. |
0095f461 | 149 | BUILD debug, release release debug: no optimization, debug symbols, DEBUG defined. release: optimizations enabled. |
10988083 | 150 | CONFIG linux-gcc, linux-intel linux-intel Select GCC or Intel compiler. |
0095f461 | 151 | ISA avx, sse avx Determines which ISA extension is used for macro definitions of the intrinsics. This is *not* the architecture the compiler generates code for. |
10988083 MW |
152 | OPENMP on, off on OpenMP, i.\,e.\. threading support. |
153 | STATISTICS on, off off View statistics, like density etc, during simulation. | |
e3f82424 | 154 | TARCH -- -- Via TARCH the architecture the compiler generates code for can be overridden. The value depends on the chosen compiler. |
10988083 MW |
155 | VERIFICATION on, off off Turn verification on/off. |
156 | VTK_OUTPUT on, off off Enable/Disable VTK file output. | |
157 | ============= ======================= ============ ========================================================== | |
158 | ||
159 | Invocation | |
160 | ========== | |
161 | ||
e3f82424 | 162 | Running the binary will print among the GPL licence header a line like the following: :: |
10988083 MW |
163 | |
164 | LBM Benchmark Kernels 0.1, compiled Jul 5 2017 21:59:22, type: verification | |
165 | ||
e3f82424 | 166 | if verfication was enabled during compilation or :: |
10988083 MW |
167 | |
168 | LBM Benchmark Kernels 0.1, compiled Jul 5 2017 21:59:22, type: benchmark | |
169 | ||
170 | if verfication was disabled during compilation. | |
171 | ||
172 | Command Line Parameters | |
173 | ----------------------- | |
174 | ||
175 | Running the binary with ``-h`` list all available parameters: :: | |
176 | ||
177 | Usage: | |
178 | ./lbmbenchk -list | |
179 | ./lbmbenchk | |
180 | [-dims XxYyZ] [-geometry box|channel|pipe|blocks[-<block size>]] [-iterations <iterations>] [-lattice-dump-ascii] | |
181 | [-rho-in <density>] [-rho-out <density] [-omega <omega>] [-kernel <kernel>] | |
182 | [-periodic-x] | |
183 | [-t <number of threads>] | |
184 | [-pin core{,core}*] | |
185 | [-verify] | |
186 | -- <kernel specific parameters> | |
187 | ||
188 | -list List available kernels. | |
189 | ||
190 | -dims XxYxZ Specify geometry dimensions. | |
191 | ||
192 | -geometry blocks-<block size> | |
193 | Geometetry with blocks of size <block size> regularily layout out. | |
194 | ||
195 | ||
196 | If an option is specified multiple times the last one overrides previous ones. | |
197 | This holds also true for ``-verify`` which sets geometry dimensions, | |
198 | iterations, etc, which can afterward be override, e.g.: :: | |
199 | ||
200 | $ bin/lbmbenchk-linux-intel-release -verfiy -dims 32x32x32 | |
201 | ||
0095f461 | 202 | Kernel specific parameters can be obtained via selecting the specific kernel |
10988083 MW |
203 | and passing ``-h`` as parameter: :: |
204 | ||
e3f82424 | 205 | $ bin/lbmbenchk-linux-intel-release -kernel kernel-name -- -h |
10988083 MW |
206 | ... |
207 | Kernel parameters: | |
208 | [-blk <n>] [-blk-[xyz] <n>] | |
209 | ||
210 | ||
211 | A list of all available kernels can be obtained via ``-list``: :: | |
212 | ||
213 | $ ../bin/lbmbenchk-linux-gcc-debug -list | |
214 | Lattice Boltzmann Benchmark Kernels (LbmBenchKernels) Copyright (C) 2016, 2017 LSS, RRZE | |
215 | This program comes with ABSOLUTELY NO WARRANTY; for details see LICENSE. | |
216 | This is free software, and you are welcome to redistribute it under certain conditions. | |
217 | ||
218 | LBM Benchmark Kernels 0.1, compiled Jul 5 2017 21:59:22, type: verification | |
219 | Available kernels to benchmark: | |
220 | list-aa-pv-soa | |
221 | list-aa-ria-soa | |
222 | list-aa-soa | |
223 | list-aa-aos | |
224 | list-pull-split-nt-1s-soa | |
225 | list-pull-split-nt-2s-soa | |
226 | list-push-soa | |
227 | list-push-aos | |
228 | list-pull-soa | |
229 | list-pull-aos | |
230 | push-soa | |
231 | push-aos | |
232 | pull-soa | |
233 | pull-aos | |
234 | blk-push-soa | |
235 | blk-push-aos | |
236 | blk-pull-soa | |
237 | blk-pull-aos | |
238 | ||
e3f82424 MW |
239 | Kernels |
240 | ------- | |
241 | ||
242 | The following list shortly describes available kernels: | |
243 | ||
244 | - push-soa/push-aos/pull-soa/pull-aos: | |
245 | Unoptimized kernels (but stream/collide are already fused) using two grids as | |
246 | source and destination. Implement push/pull semantics as well structure of | |
247 | arrays (soa) or array of structures (aos) layout. | |
248 | ||
249 | - blk-push-soa/blk-push-aos/blk-pull-soa/blk-pull-aos: | |
250 | The same as the unoptimized kernels without the blk prefix, except that they support | |
251 | spatial blocking, i.e. loop blocking of the three loops used to iterate over | |
252 | the lattice. Here manual work sharing for OpenMP is used. | |
253 | ||
254 | - list-push-soa/list-push-aos/list-pull-soa/list-pull-aos: | |
255 | The same as the unoptimized kernels without the list prefix, but for indirect addressing. | |
256 | Here only a 1D vector of is used to store the fluid nodes, omitting the | |
257 | obstacles. An adjacency list is used to recover the neighborhood associations. | |
258 | ||
259 | - list-pull-split-nt-1s-soa/list-pull-split-nt-2s-soa: | |
260 | Optimized variant of list-pull-soa. Chunks of the lattice are processed as | |
261 | once. Postcollision values are written back via nontemporal stores in 18 (1s) | |
262 | or 9 (2s) loops. | |
263 | ||
264 | - list-aa-aos/list-aa-soa: | |
265 | Unoptimized implementation of the AA pattern for the 1D vector with adjacency | |
266 | list. Supported are array of structures (aos) and structure of arrays (soa) | |
267 | data layout is supported. | |
268 | ||
269 | - list-aa-ria-soa: | |
270 | Implementation of AA pattern with intrinsics for the 1D vector with adjacency | |
271 | list. Furthermore it contains a vectorized even time step and run length | |
272 | coding to reduce the loop balance of the odd time step. | |
273 | ||
274 | - list-aa-pv-soa: | |
275 | All optimizations of list-aa-ria-soa. Additional with partial vectorization | |
276 | of the odd time step. | |
277 | ||
278 | ||
279 | Note that all array of structures (aos) kernels might require blocking | |
280 | (depending on the domain size) to reach the performance of their structure of | |
281 | arrays (soa) counter parts. | |
282 | ||
283 | The following table summarizes the properties of the kernels. Here **D** means | |
284 | direct addressing, i.e. full array, **I** means indirect addressing, i.e. 1D | |
285 | vector with adjacency list, **x** means supported, whereas **--** means unsupported. | |
286 | The loop balance B_l is computed for D3Q19 model with double precision floating | |
287 | point for PDFs (8 byte) and 4 byte integers for the index (adjacency list). | |
288 | As list-aa-ria-soa and list-aa-pv-soa support run length coding their effective | |
289 | loop balance depends on the geometry. The effective loop balance is printed | |
290 | during each run. | |
291 | ||
292 | ||
293 | ====================== =========== =========== ===== ======== ======== ============ | |
294 | kernel name prop. step data layout addr. parallel blocking B_l [B/FLUP] | |
295 | ====================== =========== =========== ===== ======== ======== ============ | |
296 | push-soa OS SoA D x -- 456 | |
297 | push-aos OS AoS D x -- 456 | |
298 | pull-soa OS SoA D x -- 456 | |
299 | pull-aos OS AoS D x -- 456 | |
300 | blk-push-soa OS SoA D x x 456 | |
301 | blk-push-aos OS AoS D x x 456 | |
302 | blk-pull-soa OS SoA D x x 456 | |
303 | blk-pull-aos OS AoS D x x 456 | |
304 | list-push-soa OS SoA I x x 528 | |
305 | list-push-aos OS AoS I x x 528 | |
306 | list-pull-soa OS SoA I x x 528 | |
307 | list-pull-aos OS AoS I x x 528 | |
308 | list-pull-split-nt-1s OS SoA I x x 376 | |
309 | list-pull-split-nt-2s OS SoA I x x 376 | |
310 | list-aa-soa AA SoA I x x 340 | |
311 | list-aa-aos AA AoS I x x 340 | |
312 | list-aa-ria-soa AA SoA I x x 304-342 | |
313 | list-aa-pv-soa AA SoA I x x 304-342 | |
314 | ====================== =========== =========== ===== ======== ======== ============ | |
10988083 MW |
315 | |
316 | Benchmarking | |
317 | ============ | |
318 | ||
319 | Correct benchmarking is a nontrivial task. Whenever benchmark results should be | |
320 | created make sure the binary was compiled with: | |
321 | ||
e3f82424 MW |
322 | - ``BENCHMARK=on`` (default if not overriden) and |
323 | - ``BUILD=release`` (default if not overriden) and | |
10988083 MW |
324 | - the correct ISA for macros is used, selected via ``ISA`` and |
325 | - use ``TARCH`` to specify the architecture the compiler generates code for. | |
0095f461 MW |
326 | |
327 | Intel Compiler | |
328 | -------------- | |
329 | ||
330 | For the Intel compiler one can specify depending on the target ISA extension: | |
331 | ||
332 | - AVX: ``TARCH=-xAVX`` | |
333 | - AVX2 and FMA: ``TARCH=-xCORE-AVX2,-fma`` | |
334 | - AVX512: ``TARCH=-xCORE-AVX512`` | |
335 | - KNL: ``TARCH=-xMIC-AVX512`` | |
336 | ||
337 | Compiling for an architecture supporting AVX (Sandy Bridge, Ivy Bridge): :: | |
338 | ||
339 | make ISA=avx TARCH=-xAVX | |
340 | ||
341 | ||
342 | Compiling for an architecture supporting AVX2 (Haswell, Broadwell): :: | |
343 | ||
344 | make ISA=avx TARCH=-xCORE-AVX2,-fma | |
345 | ||
346 | WARNING: ISA is here still set to ``avx`` as currently we have the FMA intrinsics not | |
347 | implemented. This might change in the future. | |
348 | ||
349 | ||
350 | Compiling for an architecture supporting AVX-512 (Skylake): :: | |
351 | ||
352 | make ISA=avx TARCH=-xCORE-AVX512 | |
353 | ||
354 | WARNING: ISA is here still set to ``avx`` as currently we have no implementation for the | |
355 | AVX512 intrinsics. This might change in the future. | |
356 | ||
357 | ||
358 | Pinning | |
359 | ------- | |
10988083 MW |
360 | |
361 | During benchmarking pinning should be used via the ``-pin`` parameter. Running | |
0095f461 | 362 | a benchmark with 10 threads and pin them to the first 10 cores works like :: |
10988083 MW |
363 | |
364 | $ bin/lbmbenchk-linux-intel-release ... -t 10 -pin $(seq -s , 0 9) | |
365 | ||
0095f461 MW |
366 | |
367 | General Remarks | |
368 | --------------- | |
369 | ||
370 | Things the binary does nor check or control: | |
10988083 MW |
371 | |
372 | - transparent huge pages: when allocating memory small 4 KiB pages might be | |
373 | replaced with larger ones. This is in general a good thing, but if this is | |
e3f82424 MW |
374 | really the case, depends on the system settings (check e.g. the status of |
375 | ``/sys/kernel/mm/transparent_hugepage/enabled``). | |
376 | Currently ``madvise(MADV_HUGEPAGE)`` is used for allocations which are aligned to | |
377 | a 4 KiB page, which should be the case for the lattices. | |
378 | This should result in huge pages except THP is disabled on the machine. | |
379 | (NOTE: madvise() is used if ``HAVE_HUGE_PAGES`` is defined, which is currently | |
380 | hard coded defined in ``Memory.c``). | |
10988083 MW |
381 | |
382 | - CPU/core frequency: For reproducible results the frequency of all cores | |
383 | should be fixed. | |
384 | ||
385 | - NUMA placement policy: The benchmark assumes a first touch policy, which | |
386 | means the memory will be placed at the NUMA domain the touching core is | |
387 | associated with. If a different policy is in place or the NUMA domain to be | |
388 | used is already full memory might be allocated in a remote domain. Accesses | |
389 | to remote domains typically have a higher latency and lower bandwidth. | |
390 | ||
0095f461 | 391 | - System load: interference with other application, especially on desktop |
10988083 MW |
392 | systems should be avoided. |
393 | ||
e3f82424 MW |
394 | - Padding: For SoA based kernels the number of (fluid) nodes is automatically |
395 | adjusted so that no cache or TLB thrashing should occur. The parameters are | |
396 | optimized for current Intel based systems. For more details look into the | |
397 | padding section. | |
10988083 MW |
398 | |
399 | - CPU dispatcher function: the compiler might add different versions of a | |
400 | function for different ISA extensions. Make sure the code you might think is | |
401 | executed is actually the code which is executed. | |
402 | ||
e3f82424 MW |
403 | Padding |
404 | ------- | |
405 | ||
406 | With correct padding cache and TLB thrashing can be avoided. Therefore the | |
407 | number of (fluid) nodes used in the data layout is artificially increased. | |
408 | ||
409 | Currently automatic padding is active for kernels which support it. It can be | |
410 | controlled via the kernel parameter (i.e. parameter after the ``--``) | |
411 | ``-pad``. Supported values are ``auto`` (default), ``no`` (to disable padding), | |
412 | or a manual padding. | |
413 | ||
414 | Automatic padding tries to avoid cache and TLB thrashing and pads for a 32 | |
415 | entry (huge pages) TLB with 8 sets and a 512 set (L2) cache. This reflects the | |
416 | parameters of current Intel based processors. | |
417 | ||
418 | Manual padding is done via a padding string and has the format | |
419 | ``mod_1+offset_1(,mod_n+offset_n)``, which specifies numbers of bytes. | |
420 | SoA data layouts can exhibit TLB thrashing. Therefore we want to distribute the | |
421 | 19 pages with one lattice (36 with two lattices) we are concurrently accessing | |
422 | over as much sets in the TLB as possible. | |
423 | This is controlled by the distance between the accessed pages, which is the | |
424 | number of (fluid) nodes in between them and can be adjusted by adding further | |
425 | (fluid) nodes. | |
426 | We want the distance d (in bytes) between two accessed pages to be e.g. | |
427 | **d % (PAGE_SIZE * TLB_SETS) = PAGE_SIZE**. | |
428 | This would distribute the pages evenly over the sets. Hereby **PAGE_SIZE * TLB_SETS** | |
429 | would be our ``mod_1`` and **PAGE_SIZE** (after the =) our ``offset_1``. | |
430 | Measurements show that with only a quarter of half of a page size as offset | |
431 | higher performance is achieved, which is done by automatic padding. | |
432 | On top of this padding more paddings can be added. They are just added to the | |
433 | padding string and are separated by commas. | |
434 | ||
435 | A zero modulus in the padding string has a special meaning. Here the | |
436 | corresponding offset is just added to the number of nodes. A padding string | |
437 | like ``-pad 0+16`` would at a static padding of two nodes (one node = 8 b). | |
438 | ||
439 | ||
440 | Geometries | |
441 | ========== | |
442 | ||
0095f461 MW |
443 | TODO: supported geometries: channel, pipe, blocks, fluid |
444 | ||
445 | ||
446 | Performance Results | |
447 | =================== | |
448 | ||
449 | The sections lists performance values measured on several machines for | |
450 | different kernels and geometries. | |
451 | The **RFM** column denotes the expected performance as predicted by the | |
452 | Roofline performance model [williams-2008]_. | |
453 | For performance prediction of each kernel a memory bandwidth benchmark is used | |
454 | which mimics the kernels memory access pattern and the kernel's loop balance | |
455 | (see [kernels]_ for details). | |
456 | ||
457 | Haswell, Intel Xeon E5-2695 v3 | |
458 | ------------------------------ | |
459 | ||
460 | - Haswell architecture, AVX2, FMA | |
461 | - 14 cores, 2,3 GHz | |
462 | - 2 x 7 cores in cluster-on-die (CoD) mode enabled | |
463 | - SMT enabled | |
464 | ||
465 | memory bandwidth: | |
466 | ||
467 | - copy-19 47.3 GB/s | |
468 | - copy-19-nt-sl 47.1 GB/s | |
469 | - update-19 44.0 GB/s | |
470 | ||
471 | geometry dimensions: 500x100x100 | |
472 | ||
473 | ========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ===== | |
474 | kernel pipe blocks-2 blocks-4 blocks-6 blocks-8 blocks-10 blocks-15 blocks-16 blocks-20 blocks-25 blocks-32 RFM | |
475 | ========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ===== | |
476 | blk-push-aos 58.82 49.85 57.34 59.90 61.37 62.17 65.30 64.00 67.54 64.46 69.69 104 | |
477 | blk-push-soa 32.32 33.46 34.02 34.64 35.06 35.04 36.31 35.44 37.20 35.14 37.95 104 | |
478 | blk-pull-aos 56.97 51.41 56.09 57.92 59.98 59.83 63.37 61.55 65.50 63.11 67.02 104 | |
479 | blk-pull-soa 49.29 46.23 47.50 51.97 51.27 49.52 55.23 53.13 54.50 49.79 57.90 104 | |
480 | aa-aos 91.35 66.14 76.80 84.76 83.63 91.36 93.46 92.62 93.91 92.25 92.93 145 | |
481 | aa-soa 75.51 65.68 70.94 71.36 73.83 75.46 74.84 79.48 83.28 77.70 82.72 145 | |
482 | aa-vec-soa 93.85 83.44 91.58 93.96 94.35 96.62 101.76 96.72 106.37 102.60 110.28 145 | |
483 | list-push-aos 80.29 80.97 80.95 81.10 81.37 82.44 81.77 81.49 80.72 81.93 80.93 83 | |
484 | list-push-soa 47.52 42.65 45.28 46.64 43.46 40.59 44.94 46.55 41.53 45.98 44.86 83 | |
485 | list-pull-aos 85.30 82.97 86.43 83.42 86.33 83.70 86.43 83.77 83.10 85.89 84.44 83 | |
486 | list-pull-soa 62.12 63.61 63.28 61.32 66.72 62.65 64.82 60.49 58.01 64.46 62.52 83 | |
487 | list-pull-split-nt-1s-soa 121.35 113.77 115.29 113.54 117.00 116.46 114.78 114.54 110.83 112.67 117.85 125 | |
488 | list-pull-split-nt-2s-soa 118.09 110.48 112.55 113.18 113.44 111.85 109.27 114.41 110.28 111.78 113.74 125 | |
489 | list-aa-aos 121.28 118.63 119.00 118.50 121.99 119.11 118.83 121.47 121.62 126.18 120.12 129 | |
490 | list-aa-soa 126.34 116.90 129.45 127.12 129.41 121.42 126.19 126.76 126.70 124.40 125.22 129 | |
491 | list-aa-ria-soa 133.68 121.82 126.04 128.46 131.15 132.25 128.78 133.50 126.69 124.40 130.37 145 | |
492 | list-aa-pv-soa 146.22 124.39 130.73 136.29 137.61 131.21 138.65 138.78 127.02 132.40 138.37 145 | |
493 | ========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ===== | |
494 | ||
495 | ||
496 | Broadwell, Intel Xeon E5-2630 v4 | |
497 | -------------------------------- | |
498 | ||
499 | - Broadwell architecture, AVX2, FMA | |
500 | - 10 cores, 2.2 GHz | |
501 | - SMT disabled | |
502 | ||
503 | memory bandwidth: | |
504 | ||
505 | - copy-19 48.0 GB/s | |
506 | - copy-nt-sl-19 48.2 GB/s | |
507 | - update-19 51.1 GB/s | |
508 | ||
509 | geometry dimensions: 500x100x100 | |
510 | ||
511 | ========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ======= | |
512 | kernel pipe blocks-2 blocks-4 blocks-6 blocks-8 blocks-10 blocks-15 blocks-16 blocks-20 blocks-25 blocks-32 RFM | |
513 | ========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ======= | |
514 | blk-push-aos 55.75 47.62 54.57 57.10 58.49 59.00 61.72 60.56 64.05 61.10 66.03 105 | |
515 | blk-push-soa 30.06 31.09 32.13 32.54 32.74 32.72 33.81 33.19 34.90 33.21 35.75 105 | |
516 | blk-pull-aos 53.80 48.61 53.08 54.99 56.08 56.68 59.20 58.12 61.49 58.71 63.45 105 | |
517 | blk-pull-soa 46.96 46.61 48.84 49.70 50.33 50.46 52.36 51.39 54.20 51.61 55.71 105 | |
518 | aa-aos 91.40 66.99 78.47 83.38 86.62 88.62 92.98 91.54 97.08 94.93 98.90 168 | |
519 | aa-soa 83.01 69.96 75.85 77.72 79.01 79.29 82.38 80.11 85.70 83.91 87.69 168 | |
520 | aa-vec-soa 112.03 96.52 105.32 109.76 112.55 113.82 120.55 118.37 126.30 121.37 131.94 168 | |
521 | list-push-aos 75.13 74.18 75.20 75.42 75.24 75.99 75.80 75.80 75.54 76.22 76.21 97 | |
522 | list-push-soa 40.99 38.14 39.00 38.89 38.89 39.67 39.87 39.28 39.35 40.08 40.13 97 | |
523 | list-pull-aos 82.07 82.88 83.29 83.09 83.32 83.49 82.82 82.88 83.32 82.60 82.93 97 | |
524 | list-pull-soa 62.07 60.40 61.89 61.39 62.43 60.90 60.48 62.80 62.50 61.10 60.38 97 | |
525 | list-pull-split-nt-1s-soa 125.81 120.60 121.96 122.34 122.86 123.53 123.64 123.67 125.94 124.09 123.69 128 | |
526 | list-pull-split-nt-2s-soa 122.79 117.16 118.86 119.16 119.56 119.99 120.01 120.03 122.64 120.57 120.39 128 | |
527 | list-aa-aos 128.13 127.41 129.31 129.07 129.79 129.63 129.67 129.94 129.12 128.41 129.72 150 | |
528 | list-aa-soa 141.60 139.78 141.58 142.16 141.94 141.31 142.37 142.25 142.43 141.40 142.26 150 | |
529 | list-aa-ria-soa 141.82 134.88 140.15 140.72 141.67 140.51 141.18 141.29 142.97 141.94 143.25 168 | |
530 | list-aa-pv-soa 164.79 140.95 159.24 161.78 162.40 163.04 164.69 164.38 165.11 165.75 166.09 168 | |
531 | ========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ======= | |
532 | ||
533 | ||
534 | Skylake, Intel Xeon Gold 6148 | |
535 | ----------------------------- | |
536 | ||
537 | - Skylake architecture, AVX2, FMA, AVX512 | |
538 | - 20 cores, 2.4 GHz | |
539 | - SMT enabled | |
540 | ||
541 | memory bandwidth: | |
542 | ||
543 | - copy-19 89.7 GB/s | |
544 | - copy-19-nt-sl 92.4 GB/s | |
545 | - update-19 93.6 GB/s | |
546 | ||
547 | geometry dimensions: 500x100x100 | |
548 | ||
549 | ||
550 | ========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= === | |
551 | kernel pipe blocks-2 blocks-4 blocks-6 blocks-8 blocks-10 blocks-15 blocks-16 blocks-20 blocks-25 blocks-32 RFM | |
552 | ========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= === | |
553 | blk-push-aos 113.01 93.99 108.98 114.65 117.87 119.47 124.95 122.46 129.29 123.87 133.01 197 | |
554 | blk-push-soa 100.21 98.87 103.63 105.56 107.02 107.27 111.61 109.83 116.16 110.51 110.29 197 | |
555 | blk-pull-aos 118.45 102.54 114.12 117.82 122.69 124.31 130.58 127.85 135.72 129.65 139.94 197 | |
556 | blk-pull-soa 82.60 83.36 87.13 88.39 88.84 88.96 92.48 90.93 95.79 91.92 98.64 197 | |
557 | aa-aos 171.32 125.43 147.73 157.70 163.35 167.25 175.39 174.20 182.54 173.67 187.76 308 | |
558 | aa-soa 180.85 152.39 165.84 152.59 171.90 175.76 184.94 182.34 189.43 180.30 193.54 308 | |
559 | aa-vec-soa 208.03 181.51 195.86 203.41 209.08 212.34 224.05 219.49 234.31 225.92 245.22 308 | |
560 | list-push-aos 158.81 164.67 162.93 163.05 165.22 164.31 164.66 160.78 164.07 165.19 164.06 177 | |
561 | list-push-soa 134.60 110.44 110.17 132.01 132.95 133.46 134.37 134.33 135.12 134.91 137.87 177 | |
562 | list-pull-aos 169.61 170.03 170.89 170.90 171.20 171.60 172.09 171.95 169.48 172.08 171.02 177 | |
563 | list-pull-soa 120.50 116.73 118.62 118.00 120.99 118.15 117.17 121.41 120.83 120.00 118.74 177 | |
564 | list-pull-split-nt-1s-soa 225.59 224.18 225.10 226.34 226.01 230.37 227.50 228.42 227.39 231.65 227.35 246 | |
565 | list-pull-split-nt-2s-soa 219.20 214.63 217.61 218.13 219.07 221.01 219.88 220.09 220.62 221.68 220.58 246 | |
566 | list-aa-aos 241.39 239.27 239.53 242.56 242.46 243.00 242.91 242.46 241.24 242.96 241.52 275 | |
567 | list-aa-soa 273.73 268.49 268.48 271.79 275.29 274.56 277.18 272.67 274.21 275.24 278.21 275 | |
568 | list-aa-ria-soa 288.42 261.89 273.26 284.84 283.88 288.29 290.72 289.81 293.36 290.75 292.93 308 | |
569 | list-aa-pv-soa 303.35 267.21 289.18 294.96 294.36 298.16 300.45 301.71 302.37 302.88 304.46 308 | |
570 | ========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= === | |
e3f82424 MW |
571 | |
572 | Licence | |
573 | ======= | |
574 | ||
575 | The Lattice Boltzmann Benchmark Kernels are licensed under GPLv3. | |
576 | ||
4e91c4b6 MW |
577 | |
578 | Acknowledgements | |
579 | ================ | |
580 | ||
581 | This work was funded by BMBF, grant no. 01IH15003A (project SKAMPY). | |
582 | ||
583 | This work was funded by KONWHIR project OMI4PAPS. | |
584 | ||
585 | ||
0095f461 MW |
586 | Bibliography |
587 | ============ | |
588 | ||
589 | .. [ginzburg-2008] | |
590 | I. Ginzburg, F. Verhaeghe, and D. d'Humières. | |
591 | Two-relaxation-time lattice Boltzmann scheme: About parametrization, velocity, pressure and mixed boundary conditions. | |
592 | Commun. Comput. Phys., 3(2):427-478, 2008. | |
593 | ||
594 | .. [williams-2008] | |
595 | S. Williams, A. Waterman, and D. Patterson. | |
596 | Roofline: an insightful visual performance model for multicore architectures. | |
597 | Commun. ACM, 52(4):65-76, Apr 2009. doi:10.1145/1498765.1498785 | |
598 | ||
4e91c4b6 | 599 | |
10988083 MW |
600 | .. |datetime| date:: %Y-%m-%d %H:%M |
601 | ||
602 | Document was generated at |datetime|. | |
603 |