update README and doc
[LbmBenchmarkKernelsPublic.git] / doc / main.rst
CommitLineData
10988083
MW
1.. # --------------------------------------------------------------------------
2 #
3 # Copyright
4 # Markus Wittmann, 2016-2017
5 # RRZE, University of Erlangen-Nuremberg, Germany
6 # markus.wittmann -at- fau.de or hpc -at- rrze.fau.de
7 #
8 # Viktor Haag, 2016
9 # LSS, University of Erlangen-Nuremberg, Germany
10 #
11 # This file is part of the Lattice Boltzmann Benchmark Kernels (LbmBenchKernels).
12 #
13 # LbmBenchKernels is free software: you can redistribute it and/or modify
14 # it under the terms of the GNU General Public License as published by
15 # the Free Software Foundation, either version 3 of the License, or
16 # (at your option) any later version.
17 #
18 # LbmBenchKernels is distributed in the hope that it will be useful,
19 # but WITHOUT ANY WARRANTY; without even the implied warranty of
20 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
21 # GNU General Public License for more details.
22 #
23 # You should have received a copy of the GNU General Public License
24 # along with LbmBenchKernels. If not, see <http://www.gnu.org/licenses/>.
25 #
26 # --------------------------------------------------------------------------
27
28.. title:: LBM Benchmark Kernels Documentation
29
30
31===================================
32LBM Benchmark Kernels Documentation
33===================================
34
35.. sectnum::
36.. contents::
37
0095f461
MW
38Introduction
39============
40
41The lattice Boltzmann (LBM) benchmark kernels are a collection of LBM kernel
42implementations.
43
44**AS SUCH THE LBM BENCHMARK KERNELS ARE NO FULLY EQUIPPED CFD SOLVER AND SOLELY
45SERVES THE PURPOSE OF STUDYING POSSIBLE PERFORMANCE OPTIMIZATIONS AND/OR
46EXPERIMENTS.**
47
48Currently all kernels utilize a D3Q19 discretization and the
49two-relaxation-time (TRT) collision operator [ginzburg-2008]_.
50All operations are carried out in double precision arithmetic.
51
10988083
MW
52Compilation
53===========
54
55The benchmark framework currently supports only Linux systems and the GCC and
56Intel compilers. Every other configuration probably requires adjustment inside
0095f461 57the code and the makefiles. Furthermore some code might be platform or at least
10988083
MW
58POSIX specific.
59
60The benchmark can be build via ``make`` from the ``src`` subdirectory. This will
61generate one binary which hosts all implemented benchmark kernels.
62
63Binaries are located under the ``bin`` subdirectory and will have different names
64depending on compiler and build configuration.
65
0095f461
MW
66Compilation can target debug or release builds. Combined with both build types
67verification can be enabled, which increases the runtime and hence is not
68suited for benchmarking.
69
70
10988083
MW
71Debug and Verification
72----------------------
73
74::
75
e3f82424 76 make BUILD=debug BENCHMARK=off
10988083 77
e3f82424 78Running ``make`` with ``BUILD=debug`` builds the debug version of
10988083
MW
79the benchmark kernels, where no optimizations are performed, line numbers and
80debug symbols are included as well as ``DEBUG`` will be defined. The resulting
81binary will be found in the ``bin`` subdirectory and named
82``lbmbenchk-linux-<compiler>-debug``.
83
e3f82424
MW
84Specifying ``BENCHMARK=off`` turns on verification
85(``VERIFICATION=on``), statistics (``STATISTICS=on``), and VTK output
10988083
MW
86(``VTK_OUTPUT=on``) enabled.
87
88Please note that the generated binary will therefore
89exhibit a poor performance.
90
0095f461
MW
91
92Release and Verification
93------------------------
94
95Verification with the debug builds can be extremely slow. Hence verification
96capabilities can be build with release builds: ::
97
98 make BENCHMARK=off
99
100
10988083
MW
101Benchmarking
102------------
103
104To generate a binary for benchmarking run make with ::
105
e3f82424 106 make
10988083 107
e3f82424 108As default ``BENCHMARK=on`` and ``BUILD=release`` is set, where
0095f461 109``BUILD=release`` turns optimizations on and ``BENCHMARK=on`` disables
10988083
MW
110verfification, statistics, and VTK output.
111
0095f461
MW
112See Options Summary below for further description of options which can be
113applied, e.g. TARCH as well as the Benchmarking section.
10988083
MW
114
115Compilers
116---------
117
118Currently only the GCC and Intel compiler under Linux are supported. Between
119both configuration can be chosen via ``CONFIG=linux-gcc`` or
120``CONFIG=linux-intel``.
121
e3f82424
MW
122
123Cleaning
124--------
125
126For each configuration and build (debug/release) a subdirectory under the
127``src/obj`` directory is created where the dependency and object files are
128stored.
129With ::
130
131 make CONFIG=... BUILD=... clean
132
133a specific combination is select and cleaned, whereas with ::
134
135 make clean-all
136
137all object and dependency files are deleted.
138
139
10988083
MW
140Options Summary
141---------------
142
0095f461 143Options that can be specified when building the suite with make:
10988083
MW
144
145============= ======================= ============ ==========================================================
146name values default description
0095f461 147============= ======================= ============ ==========================================================
e3f82424 148BENCHMARK on, off on If enabled, disables VERIFICATION, STATISTICS, VTK_OUTPUT. If disabled enables the three former options.
0095f461 149BUILD debug, release release debug: no optimization, debug symbols, DEBUG defined. release: optimizations enabled.
10988083 150CONFIG linux-gcc, linux-intel linux-intel Select GCC or Intel compiler.
0095f461 151ISA avx, sse avx Determines which ISA extension is used for macro definitions of the intrinsics. This is *not* the architecture the compiler generates code for.
10988083
MW
152OPENMP on, off on OpenMP, i.\,e.\. threading support.
153STATISTICS on, off off View statistics, like density etc, during simulation.
e3f82424 154TARCH -- -- Via TARCH the architecture the compiler generates code for can be overridden. The value depends on the chosen compiler.
10988083
MW
155VERIFICATION on, off off Turn verification on/off.
156VTK_OUTPUT on, off off Enable/Disable VTK file output.
157============= ======================= ============ ==========================================================
158
159Invocation
160==========
161
e3f82424 162Running the binary will print among the GPL licence header a line like the following: ::
10988083
MW
163
164 LBM Benchmark Kernels 0.1, compiled Jul 5 2017 21:59:22, type: verification
165
e3f82424 166if verfication was enabled during compilation or ::
10988083
MW
167
168 LBM Benchmark Kernels 0.1, compiled Jul 5 2017 21:59:22, type: benchmark
169
170if verfication was disabled during compilation.
171
172Command Line Parameters
173-----------------------
174
175Running the binary with ``-h`` list all available parameters: ::
176
177 Usage:
178 ./lbmbenchk -list
179 ./lbmbenchk
180 [-dims XxYyZ] [-geometry box|channel|pipe|blocks[-<block size>]] [-iterations <iterations>] [-lattice-dump-ascii]
181 [-rho-in <density>] [-rho-out <density] [-omega <omega>] [-kernel <kernel>]
182 [-periodic-x]
183 [-t <number of threads>]
184 [-pin core{,core}*]
185 [-verify]
186 -- <kernel specific parameters>
187
188 -list List available kernels.
189
190 -dims XxYxZ Specify geometry dimensions.
191
192 -geometry blocks-<block size>
193 Geometetry with blocks of size <block size> regularily layout out.
194
195
196If an option is specified multiple times the last one overrides previous ones.
197This holds also true for ``-verify`` which sets geometry dimensions,
198iterations, etc, which can afterward be override, e.g.: ::
199
200 $ bin/lbmbenchk-linux-intel-release -verfiy -dims 32x32x32
201
0095f461 202Kernel specific parameters can be obtained via selecting the specific kernel
10988083
MW
203and passing ``-h`` as parameter: ::
204
e3f82424 205 $ bin/lbmbenchk-linux-intel-release -kernel kernel-name -- -h
10988083
MW
206 ...
207 Kernel parameters:
208 [-blk <n>] [-blk-[xyz] <n>]
209
210
211A list of all available kernels can be obtained via ``-list``: ::
212
213 $ ../bin/lbmbenchk-linux-gcc-debug -list
214 Lattice Boltzmann Benchmark Kernels (LbmBenchKernels) Copyright (C) 2016, 2017 LSS, RRZE
215 This program comes with ABSOLUTELY NO WARRANTY; for details see LICENSE.
216 This is free software, and you are welcome to redistribute it under certain conditions.
217
218 LBM Benchmark Kernels 0.1, compiled Jul 5 2017 21:59:22, type: verification
219 Available kernels to benchmark:
220 list-aa-pv-soa
221 list-aa-ria-soa
222 list-aa-soa
223 list-aa-aos
224 list-pull-split-nt-1s-soa
225 list-pull-split-nt-2s-soa
226 list-push-soa
227 list-push-aos
228 list-pull-soa
229 list-pull-aos
230 push-soa
231 push-aos
232 pull-soa
233 pull-aos
234 blk-push-soa
235 blk-push-aos
236 blk-pull-soa
237 blk-pull-aos
238
e3f82424
MW
239Kernels
240-------
241
242The following list shortly describes available kernels:
243
244- push-soa/push-aos/pull-soa/pull-aos:
245 Unoptimized kernels (but stream/collide are already fused) using two grids as
246 source and destination. Implement push/pull semantics as well structure of
247 arrays (soa) or array of structures (aos) layout.
248
249- blk-push-soa/blk-push-aos/blk-pull-soa/blk-pull-aos:
250 The same as the unoptimized kernels without the blk prefix, except that they support
251 spatial blocking, i.e. loop blocking of the three loops used to iterate over
252 the lattice. Here manual work sharing for OpenMP is used.
253
254- list-push-soa/list-push-aos/list-pull-soa/list-pull-aos:
255 The same as the unoptimized kernels without the list prefix, but for indirect addressing.
256 Here only a 1D vector of is used to store the fluid nodes, omitting the
257 obstacles. An adjacency list is used to recover the neighborhood associations.
258
259- list-pull-split-nt-1s-soa/list-pull-split-nt-2s-soa:
260 Optimized variant of list-pull-soa. Chunks of the lattice are processed as
261 once. Postcollision values are written back via nontemporal stores in 18 (1s)
262 or 9 (2s) loops.
263
264- list-aa-aos/list-aa-soa:
265 Unoptimized implementation of the AA pattern for the 1D vector with adjacency
266 list. Supported are array of structures (aos) and structure of arrays (soa)
267 data layout is supported.
268
269- list-aa-ria-soa:
270 Implementation of AA pattern with intrinsics for the 1D vector with adjacency
271 list. Furthermore it contains a vectorized even time step and run length
272 coding to reduce the loop balance of the odd time step.
273
274- list-aa-pv-soa:
275 All optimizations of list-aa-ria-soa. Additional with partial vectorization
276 of the odd time step.
277
278
279Note that all array of structures (aos) kernels might require blocking
280(depending on the domain size) to reach the performance of their structure of
281arrays (soa) counter parts.
282
283The following table summarizes the properties of the kernels. Here **D** means
284direct addressing, i.e. full array, **I** means indirect addressing, i.e. 1D
285vector with adjacency list, **x** means supported, whereas **--** means unsupported.
286The loop balance B_l is computed for D3Q19 model with double precision floating
287point for PDFs (8 byte) and 4 byte integers for the index (adjacency list).
288As list-aa-ria-soa and list-aa-pv-soa support run length coding their effective
289loop balance depends on the geometry. The effective loop balance is printed
290during each run.
291
292
293====================== =========== =========== ===== ======== ======== ============
294kernel name prop. step data layout addr. parallel blocking B_l [B/FLUP]
295====================== =========== =========== ===== ======== ======== ============
296push-soa OS SoA D x -- 456
297push-aos OS AoS D x -- 456
298pull-soa OS SoA D x -- 456
299pull-aos OS AoS D x -- 456
300blk-push-soa OS SoA D x x 456
301blk-push-aos OS AoS D x x 456
302blk-pull-soa OS SoA D x x 456
303blk-pull-aos OS AoS D x x 456
304list-push-soa OS SoA I x x 528
305list-push-aos OS AoS I x x 528
306list-pull-soa OS SoA I x x 528
307list-pull-aos OS AoS I x x 528
308list-pull-split-nt-1s OS SoA I x x 376
309list-pull-split-nt-2s OS SoA I x x 376
310list-aa-soa AA SoA I x x 340
311list-aa-aos AA AoS I x x 340
312list-aa-ria-soa AA SoA I x x 304-342
313list-aa-pv-soa AA SoA I x x 304-342
314====================== =========== =========== ===== ======== ======== ============
10988083
MW
315
316Benchmarking
317============
318
319Correct benchmarking is a nontrivial task. Whenever benchmark results should be
320created make sure the binary was compiled with:
321
e3f82424
MW
322- ``BENCHMARK=on`` (default if not overriden) and
323- ``BUILD=release`` (default if not overriden) and
10988083
MW
324- the correct ISA for macros is used, selected via ``ISA`` and
325- use ``TARCH`` to specify the architecture the compiler generates code for.
0095f461
MW
326
327Intel Compiler
328--------------
329
330For the Intel compiler one can specify depending on the target ISA extension:
331
332- AVX: ``TARCH=-xAVX``
333- AVX2 and FMA: ``TARCH=-xCORE-AVX2,-fma``
334- AVX512: ``TARCH=-xCORE-AVX512``
335- KNL: ``TARCH=-xMIC-AVX512``
336
337Compiling for an architecture supporting AVX (Sandy Bridge, Ivy Bridge): ::
338
339 make ISA=avx TARCH=-xAVX
340
341
342Compiling for an architecture supporting AVX2 (Haswell, Broadwell): ::
343
344 make ISA=avx TARCH=-xCORE-AVX2,-fma
345
346WARNING: ISA is here still set to ``avx`` as currently we have the FMA intrinsics not
347implemented. This might change in the future.
348
349
350Compiling for an architecture supporting AVX-512 (Skylake): ::
351
352 make ISA=avx TARCH=-xCORE-AVX512
353
354WARNING: ISA is here still set to ``avx`` as currently we have no implementation for the
355AVX512 intrinsics. This might change in the future.
356
357
358Pinning
359-------
10988083
MW
360
361During benchmarking pinning should be used via the ``-pin`` parameter. Running
0095f461 362a benchmark with 10 threads and pin them to the first 10 cores works like ::
10988083
MW
363
364 $ bin/lbmbenchk-linux-intel-release ... -t 10 -pin $(seq -s , 0 9)
365
0095f461
MW
366
367General Remarks
368---------------
369
370Things the binary does nor check or control:
10988083
MW
371
372- transparent huge pages: when allocating memory small 4 KiB pages might be
373 replaced with larger ones. This is in general a good thing, but if this is
e3f82424
MW
374 really the case, depends on the system settings (check e.g. the status of
375 ``/sys/kernel/mm/transparent_hugepage/enabled``).
376 Currently ``madvise(MADV_HUGEPAGE)`` is used for allocations which are aligned to
377 a 4 KiB page, which should be the case for the lattices.
378 This should result in huge pages except THP is disabled on the machine.
379 (NOTE: madvise() is used if ``HAVE_HUGE_PAGES`` is defined, which is currently
380 hard coded defined in ``Memory.c``).
10988083
MW
381
382- CPU/core frequency: For reproducible results the frequency of all cores
383 should be fixed.
384
385- NUMA placement policy: The benchmark assumes a first touch policy, which
386 means the memory will be placed at the NUMA domain the touching core is
387 associated with. If a different policy is in place or the NUMA domain to be
388 used is already full memory might be allocated in a remote domain. Accesses
389 to remote domains typically have a higher latency and lower bandwidth.
390
0095f461 391- System load: interference with other application, especially on desktop
10988083
MW
392 systems should be avoided.
393
e3f82424
MW
394- Padding: For SoA based kernels the number of (fluid) nodes is automatically
395 adjusted so that no cache or TLB thrashing should occur. The parameters are
396 optimized for current Intel based systems. For more details look into the
397 padding section.
10988083
MW
398
399- CPU dispatcher function: the compiler might add different versions of a
400 function for different ISA extensions. Make sure the code you might think is
401 executed is actually the code which is executed.
402
e3f82424
MW
403Padding
404-------
405
406With correct padding cache and TLB thrashing can be avoided. Therefore the
407number of (fluid) nodes used in the data layout is artificially increased.
408
409Currently automatic padding is active for kernels which support it. It can be
410controlled via the kernel parameter (i.e. parameter after the ``--``)
411``-pad``. Supported values are ``auto`` (default), ``no`` (to disable padding),
412or a manual padding.
413
414Automatic padding tries to avoid cache and TLB thrashing and pads for a 32
415entry (huge pages) TLB with 8 sets and a 512 set (L2) cache. This reflects the
416parameters of current Intel based processors.
417
418Manual padding is done via a padding string and has the format
419``mod_1+offset_1(,mod_n+offset_n)``, which specifies numbers of bytes.
420SoA data layouts can exhibit TLB thrashing. Therefore we want to distribute the
42119 pages with one lattice (36 with two lattices) we are concurrently accessing
422over as much sets in the TLB as possible.
423This is controlled by the distance between the accessed pages, which is the
424number of (fluid) nodes in between them and can be adjusted by adding further
425(fluid) nodes.
426We want the distance d (in bytes) between two accessed pages to be e.g.
427**d % (PAGE_SIZE * TLB_SETS) = PAGE_SIZE**.
428This would distribute the pages evenly over the sets. Hereby **PAGE_SIZE * TLB_SETS**
429would be our ``mod_1`` and **PAGE_SIZE** (after the =) our ``offset_1``.
430Measurements show that with only a quarter of half of a page size as offset
431higher performance is achieved, which is done by automatic padding.
432On top of this padding more paddings can be added. They are just added to the
433padding string and are separated by commas.
434
435A zero modulus in the padding string has a special meaning. Here the
436corresponding offset is just added to the number of nodes. A padding string
437like ``-pad 0+16`` would at a static padding of two nodes (one node = 8 b).
438
439
440Geometries
441==========
442
0095f461
MW
443TODO: supported geometries: channel, pipe, blocks, fluid
444
445
446Performance Results
447===================
448
449The sections lists performance values measured on several machines for
450different kernels and geometries.
451The **RFM** column denotes the expected performance as predicted by the
452Roofline performance model [williams-2008]_.
453For performance prediction of each kernel a memory bandwidth benchmark is used
454which mimics the kernels memory access pattern and the kernel's loop balance
455(see [kernels]_ for details).
456
457Haswell, Intel Xeon E5-2695 v3
458------------------------------
459
460- Haswell architecture, AVX2, FMA
461- 14 cores, 2,3 GHz
462- 2 x 7 cores in cluster-on-die (CoD) mode enabled
463- SMT enabled
464
465memory bandwidth:
466
467- copy-19 47.3 GB/s
468- copy-19-nt-sl 47.1 GB/s
469- update-19 44.0 GB/s
470
471geometry dimensions: 500x100x100
472
473========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= =====
474kernel pipe blocks-2 blocks-4 blocks-6 blocks-8 blocks-10 blocks-15 blocks-16 blocks-20 blocks-25 blocks-32 RFM
475========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= =====
476blk-push-aos 58.82 49.85 57.34 59.90 61.37 62.17 65.30 64.00 67.54 64.46 69.69 104
477blk-push-soa 32.32 33.46 34.02 34.64 35.06 35.04 36.31 35.44 37.20 35.14 37.95 104
478blk-pull-aos 56.97 51.41 56.09 57.92 59.98 59.83 63.37 61.55 65.50 63.11 67.02 104
479blk-pull-soa 49.29 46.23 47.50 51.97 51.27 49.52 55.23 53.13 54.50 49.79 57.90 104
480aa-aos 91.35 66.14 76.80 84.76 83.63 91.36 93.46 92.62 93.91 92.25 92.93 145
481aa-soa 75.51 65.68 70.94 71.36 73.83 75.46 74.84 79.48 83.28 77.70 82.72 145
482aa-vec-soa 93.85 83.44 91.58 93.96 94.35 96.62 101.76 96.72 106.37 102.60 110.28 145
483list-push-aos 80.29 80.97 80.95 81.10 81.37 82.44 81.77 81.49 80.72 81.93 80.93 83
484list-push-soa 47.52 42.65 45.28 46.64 43.46 40.59 44.94 46.55 41.53 45.98 44.86 83
485list-pull-aos 85.30 82.97 86.43 83.42 86.33 83.70 86.43 83.77 83.10 85.89 84.44 83
486list-pull-soa 62.12 63.61 63.28 61.32 66.72 62.65 64.82 60.49 58.01 64.46 62.52 83
487list-pull-split-nt-1s-soa 121.35 113.77 115.29 113.54 117.00 116.46 114.78 114.54 110.83 112.67 117.85 125
488list-pull-split-nt-2s-soa 118.09 110.48 112.55 113.18 113.44 111.85 109.27 114.41 110.28 111.78 113.74 125
489list-aa-aos 121.28 118.63 119.00 118.50 121.99 119.11 118.83 121.47 121.62 126.18 120.12 129
490list-aa-soa 126.34 116.90 129.45 127.12 129.41 121.42 126.19 126.76 126.70 124.40 125.22 129
491list-aa-ria-soa 133.68 121.82 126.04 128.46 131.15 132.25 128.78 133.50 126.69 124.40 130.37 145
492list-aa-pv-soa 146.22 124.39 130.73 136.29 137.61 131.21 138.65 138.78 127.02 132.40 138.37 145
493========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= =====
494
495
496Broadwell, Intel Xeon E5-2630 v4
497--------------------------------
498
499- Broadwell architecture, AVX2, FMA
500- 10 cores, 2.2 GHz
501- SMT disabled
502
503memory bandwidth:
504
505- copy-19 48.0 GB/s
506- copy-nt-sl-19 48.2 GB/s
507- update-19 51.1 GB/s
508
509geometry dimensions: 500x100x100
510
511========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= =======
512kernel pipe blocks-2 blocks-4 blocks-6 blocks-8 blocks-10 blocks-15 blocks-16 blocks-20 blocks-25 blocks-32 RFM
513========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= =======
514blk-push-aos 55.75 47.62 54.57 57.10 58.49 59.00 61.72 60.56 64.05 61.10 66.03 105
515blk-push-soa 30.06 31.09 32.13 32.54 32.74 32.72 33.81 33.19 34.90 33.21 35.75 105
516blk-pull-aos 53.80 48.61 53.08 54.99 56.08 56.68 59.20 58.12 61.49 58.71 63.45 105
517blk-pull-soa 46.96 46.61 48.84 49.70 50.33 50.46 52.36 51.39 54.20 51.61 55.71 105
518aa-aos 91.40 66.99 78.47 83.38 86.62 88.62 92.98 91.54 97.08 94.93 98.90 168
519aa-soa 83.01 69.96 75.85 77.72 79.01 79.29 82.38 80.11 85.70 83.91 87.69 168
520aa-vec-soa 112.03 96.52 105.32 109.76 112.55 113.82 120.55 118.37 126.30 121.37 131.94 168
521list-push-aos 75.13 74.18 75.20 75.42 75.24 75.99 75.80 75.80 75.54 76.22 76.21 97
522list-push-soa 40.99 38.14 39.00 38.89 38.89 39.67 39.87 39.28 39.35 40.08 40.13 97
523list-pull-aos 82.07 82.88 83.29 83.09 83.32 83.49 82.82 82.88 83.32 82.60 82.93 97
524list-pull-soa 62.07 60.40 61.89 61.39 62.43 60.90 60.48 62.80 62.50 61.10 60.38 97
525list-pull-split-nt-1s-soa 125.81 120.60 121.96 122.34 122.86 123.53 123.64 123.67 125.94 124.09 123.69 128
526list-pull-split-nt-2s-soa 122.79 117.16 118.86 119.16 119.56 119.99 120.01 120.03 122.64 120.57 120.39 128
527list-aa-aos 128.13 127.41 129.31 129.07 129.79 129.63 129.67 129.94 129.12 128.41 129.72 150
528list-aa-soa 141.60 139.78 141.58 142.16 141.94 141.31 142.37 142.25 142.43 141.40 142.26 150
529list-aa-ria-soa 141.82 134.88 140.15 140.72 141.67 140.51 141.18 141.29 142.97 141.94 143.25 168
530list-aa-pv-soa 164.79 140.95 159.24 161.78 162.40 163.04 164.69 164.38 165.11 165.75 166.09 168
531========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= =======
532
533
534Skylake, Intel Xeon Gold 6148
535-----------------------------
536
537- Skylake architecture, AVX2, FMA, AVX512
538- 20 cores, 2.4 GHz
539- SMT enabled
540
541memory bandwidth:
542
543- copy-19 89.7 GB/s
544- copy-19-nt-sl 92.4 GB/s
545- update-19 93.6 GB/s
546
547geometry dimensions: 500x100x100
548
549
550========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ===
551kernel pipe blocks-2 blocks-4 blocks-6 blocks-8 blocks-10 blocks-15 blocks-16 blocks-20 blocks-25 blocks-32 RFM
552========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ===
553blk-push-aos 113.01 93.99 108.98 114.65 117.87 119.47 124.95 122.46 129.29 123.87 133.01 197
554blk-push-soa 100.21 98.87 103.63 105.56 107.02 107.27 111.61 109.83 116.16 110.51 110.29 197
555blk-pull-aos 118.45 102.54 114.12 117.82 122.69 124.31 130.58 127.85 135.72 129.65 139.94 197
556blk-pull-soa 82.60 83.36 87.13 88.39 88.84 88.96 92.48 90.93 95.79 91.92 98.64 197
557aa-aos 171.32 125.43 147.73 157.70 163.35 167.25 175.39 174.20 182.54 173.67 187.76 308
558aa-soa 180.85 152.39 165.84 152.59 171.90 175.76 184.94 182.34 189.43 180.30 193.54 308
559aa-vec-soa 208.03 181.51 195.86 203.41 209.08 212.34 224.05 219.49 234.31 225.92 245.22 308
560list-push-aos 158.81 164.67 162.93 163.05 165.22 164.31 164.66 160.78 164.07 165.19 164.06 177
561list-push-soa 134.60 110.44 110.17 132.01 132.95 133.46 134.37 134.33 135.12 134.91 137.87 177
562list-pull-aos 169.61 170.03 170.89 170.90 171.20 171.60 172.09 171.95 169.48 172.08 171.02 177
563list-pull-soa 120.50 116.73 118.62 118.00 120.99 118.15 117.17 121.41 120.83 120.00 118.74 177
564list-pull-split-nt-1s-soa 225.59 224.18 225.10 226.34 226.01 230.37 227.50 228.42 227.39 231.65 227.35 246
565list-pull-split-nt-2s-soa 219.20 214.63 217.61 218.13 219.07 221.01 219.88 220.09 220.62 221.68 220.58 246
566list-aa-aos 241.39 239.27 239.53 242.56 242.46 243.00 242.91 242.46 241.24 242.96 241.52 275
567list-aa-soa 273.73 268.49 268.48 271.79 275.29 274.56 277.18 272.67 274.21 275.24 278.21 275
568list-aa-ria-soa 288.42 261.89 273.26 284.84 283.88 288.29 290.72 289.81 293.36 290.75 292.93 308
569list-aa-pv-soa 303.35 267.21 289.18 294.96 294.36 298.16 300.45 301.71 302.37 302.88 304.46 308
570========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ===
e3f82424
MW
571
572Licence
573=======
574
575The Lattice Boltzmann Benchmark Kernels are licensed under GPLv3.
576
4e91c4b6
MW
577
578Acknowledgements
579================
580
581This work was funded by BMBF, grant no. 01IH15003A (project SKAMPY).
582
583This work was funded by KONWHIR project OMI4PAPS.
584
585
0095f461
MW
586Bibliography
587============
588
589.. [ginzburg-2008]
590 I. Ginzburg, F. Verhaeghe, and D. d'Humières.
591 Two-relaxation-time lattice Boltzmann scheme: About parametrization, velocity, pressure and mixed boundary conditions.
592 Commun. Comput. Phys., 3(2):427-478, 2008.
593
594.. [williams-2008]
595 S. Williams, A. Waterman, and D. Patterson.
596 Roofline: an insightful visual performance model for multicore architectures.
597 Commun. ACM, 52(4):65-76, Apr 2009. doi:10.1145/1498765.1498785
598
4e91c4b6 599
10988083
MW
600.. |datetime| date:: %Y-%m-%d %H:%M
601
602Document was generated at |datetime|.
603
This page took 0.115217 seconds and 5 git commands to generate.