merge with kernels from MH's master thesis
[LbmBenchmarkKernelsPublic.git] / doc / main.rst
CommitLineData
0fde6e45
MW
1
2| Copyright
3| Markus Wittmann, 2016-2018
4| RRZE, University of Erlangen-Nuremberg, Germany
5| markus.wittmann -at- fau.de or hpc -at- rrze.fau.de
6|
7| Viktor Haag, 2016
8| LSS, University of Erlangen-Nuremberg, Germany
9|
8cafd9ea
MW
10| Michael Hussnaetter, 2017-2018
11| University of Erlangen-Nuremberg, Germany
12| michael.hussnaetter -at- fau.de
13|
0fde6e45
MW
14| This file is part of the Lattice Boltzmann Benchmark Kernels (LbmBenchKernels).
15|
16| LbmBenchKernels is free software: you can redistribute it and/or modify
17| it under the terms of the GNU General Public License as published by
18| the Free Software Foundation, either version 3 of the License, or
19| (at your option) any later version.
20|
21| LbmBenchKernels is distributed in the hope that it will be useful,
22| but WITHOUT ANY WARRANTY; without even the implied warranty of
23| MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
24| GNU General Public License for more details.
25|
26| You should have received a copy of the GNU General Public License
27| along with LbmBenchKernels. If not, see <http://www.gnu.org/licenses/>.
10988083
MW
28
29.. title:: LBM Benchmark Kernels Documentation
30
31
0fde6e45 32**LBM Benchmark Kernels Documentation**
10988083
MW
33
34.. sectnum::
35.. contents::
36
0095f461
MW
37Introduction
38============
39
40The lattice Boltzmann (LBM) benchmark kernels are a collection of LBM kernel
41implementations.
42
43**AS SUCH THE LBM BENCHMARK KERNELS ARE NO FULLY EQUIPPED CFD SOLVER AND SOLELY
44SERVES THE PURPOSE OF STUDYING POSSIBLE PERFORMANCE OPTIMIZATIONS AND/OR
45EXPERIMENTS.**
46
47Currently all kernels utilize a D3Q19 discretization and the
48two-relaxation-time (TRT) collision operator [ginzburg-2008]_.
0fde6e45 49All operations are carried out in double or single precision arithmetic.
0095f461 50
10988083
MW
51Compilation
52===========
53
54The benchmark framework currently supports only Linux systems and the GCC and
55Intel compilers. Every other configuration probably requires adjustment inside
0095f461 56the code and the makefiles. Furthermore some code might be platform or at least
10988083
MW
57POSIX specific.
58
59The benchmark can be build via ``make`` from the ``src`` subdirectory. This will
60generate one binary which hosts all implemented benchmark kernels.
61
62Binaries are located under the ``bin`` subdirectory and will have different names
63depending on compiler and build configuration.
64
0095f461
MW
65Compilation can target debug or release builds. Combined with both build types
66verification can be enabled, which increases the runtime and hence is not
67suited for benchmarking.
68
69
10988083
MW
70Debug and Verification
71----------------------
72
73::
74
e3f82424 75 make BUILD=debug BENCHMARK=off
10988083 76
e3f82424 77Running ``make`` with ``BUILD=debug`` builds the debug version of
10988083
MW
78the benchmark kernels, where no optimizations are performed, line numbers and
79debug symbols are included as well as ``DEBUG`` will be defined. The resulting
80binary will be found in the ``bin`` subdirectory and named
81``lbmbenchk-linux-<compiler>-debug``.
82
e3f82424
MW
83Specifying ``BENCHMARK=off`` turns on verification
84(``VERIFICATION=on``), statistics (``STATISTICS=on``), and VTK output
10988083
MW
85(``VTK_OUTPUT=on``) enabled.
86
87Please note that the generated binary will therefore
88exhibit a poor performance.
89
0095f461
MW
90
91Release and Verification
92------------------------
93
94Verification with the debug builds can be extremely slow. Hence verification
95capabilities can be build with release builds: ::
96
97 make BENCHMARK=off
98
99
10988083
MW
100Benchmarking
101------------
102
103To generate a binary for benchmarking run make with ::
104
e3f82424 105 make
10988083 106
e3f82424 107As default ``BENCHMARK=on`` and ``BUILD=release`` is set, where
0095f461 108``BUILD=release`` turns optimizations on and ``BENCHMARK=on`` disables
10988083
MW
109verfification, statistics, and VTK output.
110
0095f461
MW
111See Options Summary below for further description of options which can be
112applied, e.g. TARCH as well as the Benchmarking section.
10988083
MW
113
114Compilers
115---------
116
117Currently only the GCC and Intel compiler under Linux are supported. Between
118both configuration can be chosen via ``CONFIG=linux-gcc`` or
119``CONFIG=linux-intel``.
120
e3f82424 121
0fde6e45
MW
122Floating Point Precision
123------------------------
124
125As default double precision data types are used for storing PDFs and floating
126point constants. Furthermore, this is the default for the intrincis kernels.
127With the ``PRECISION=sp`` variable this can be changed to single precision. ::
128
129 make PRECISION=sp # build for single precision kernels
130
131 make PRECISION=dp # build for double precision kernels (defalt)
132
133
e3f82424
MW
134Cleaning
135--------
136
137For each configuration and build (debug/release) a subdirectory under the
138``src/obj`` directory is created where the dependency and object files are
139stored.
140With ::
141
142 make CONFIG=... BUILD=... clean
143
144a specific combination is select and cleaned, whereas with ::
145
146 make clean-all
147
148all object and dependency files are deleted.
149
150
10988083
MW
151Options Summary
152---------------
153
0095f461 154Options that can be specified when building the suite with make:
10988083
MW
155
156============= ======================= ============ ==========================================================
157name values default description
0095f461 158============= ======================= ============ ==========================================================
e3f82424 159BENCHMARK on, off on If enabled, disables VERIFICATION, STATISTICS, VTK_OUTPUT. If disabled enables the three former options.
0095f461 160BUILD debug, release release debug: no optimization, debug symbols, DEBUG defined. release: optimizations enabled.
10988083 161CONFIG linux-gcc, linux-intel linux-intel Select GCC or Intel compiler.
8cafd9ea
MW
162ISA avx512, avx, sse avx Determines which ISA extension is used for macro definitions of the intrinsics. This is *not* the architecture the compiler generates code for.
163OPENMP on, off on OpenMP, i.e. threading support.
0fde6e45 164PRECISION dp, sp dp Floating point precision used for data type, arithmetic, and intrincics.
10988083 165STATISTICS on, off off View statistics, like density etc, during simulation.
e3f82424 166TARCH -- -- Via TARCH the architecture the compiler generates code for can be overridden. The value depends on the chosen compiler.
10988083
MW
167VERIFICATION on, off off Turn verification on/off.
168VTK_OUTPUT on, off off Enable/Disable VTK file output.
169============= ======================= ============ ==========================================================
170
8cafd9ea
MW
171**Suboptions for ``ISA=avx512``**
172
173============================== ======== ======== ======================
174name values default description
175============================== ======== ======== ======================
176ADJ_LIST_MEM_TYPE HBM - Determines memory location of adjacency list array, DRAM or HBM.
177PDF_MEM_TYPE HBM - Determines memory location of PDF array, DRAM or HBM.
178SOFTWARE_PREFETCH_LOOKAHEAD_L1 int >= 0 0 Software prefetch lookahead of elements into L1 cache, value is multiplied by vector size (``VSIZE``).
179SOFTWARE_PREFETCH_LOOKAHEAD_L2 int >= 0 0 Software prefetch lookahead of elements into L2 cache, value is multiplied by vector size (``VSIZE``).
180============================== ======== ======== ======================
181
182Please note this options require AVX-512 PF support of the target processor.
183
10988083
MW
184Invocation
185==========
186
e3f82424 187Running the binary will print among the GPL licence header a line like the following: ::
10988083
MW
188
189 LBM Benchmark Kernels 0.1, compiled Jul 5 2017 21:59:22, type: verification
190
e3f82424 191if verfication was enabled during compilation or ::
10988083
MW
192
193 LBM Benchmark Kernels 0.1, compiled Jul 5 2017 21:59:22, type: benchmark
194
195if verfication was disabled during compilation.
196
197Command Line Parameters
198-----------------------
199
200Running the binary with ``-h`` list all available parameters: ::
201
202 Usage:
203 ./lbmbenchk -list
204 ./lbmbenchk
8cafd9ea 205 [-dims XxYxZ] [-geometry box|channel|pipe|blocks[-<block size>]] [-iterations <iterations>] [-lattice-dump-ascii]
10988083
MW
206 [-rho-in <density>] [-rho-out <density] [-omega <omega>] [-kernel <kernel>]
207 [-periodic-x]
208 [-t <number of threads>]
209 [-pin core{,core}*]
210 [-verify]
211 -- <kernel specific parameters>
212
213 -list List available kernels.
214
215 -dims XxYxZ Specify geometry dimensions.
216
217 -geometry blocks-<block size>
218 Geometetry with blocks of size <block size> regularily layout out.
219
220
221If an option is specified multiple times the last one overrides previous ones.
222This holds also true for ``-verify`` which sets geometry dimensions,
223iterations, etc, which can afterward be override, e.g.: ::
224
0fde6e45 225 $ bin/lbmbenchk-linux-intel-release-dp -verfiy -dims 32x32x32
10988083 226
0095f461 227Kernel specific parameters can be obtained via selecting the specific kernel
10988083
MW
228and passing ``-h`` as parameter: ::
229
0fde6e45 230 $ bin/lbmbenchk-linux-intel-release-dp -kernel kernel-name -- -h
10988083
MW
231 ...
232 Kernel parameters:
233 [-blk <n>] [-blk-[xyz] <n>]
234
235
236A list of all available kernels can be obtained via ``-list``: ::
237
0fde6e45 238 $ ../bin/lbmbenchk-linux-gcc-debug-dp -list
10988083
MW
239 Lattice Boltzmann Benchmark Kernels (LbmBenchKernels) Copyright (C) 2016, 2017 LSS, RRZE
240 This program comes with ABSOLUTELY NO WARRANTY; for details see LICENSE.
241 This is free software, and you are welcome to redistribute it under certain conditions.
242
243 LBM Benchmark Kernels 0.1, compiled Jul 5 2017 21:59:22, type: verification
244 Available kernels to benchmark:
245 list-aa-pv-soa
246 list-aa-ria-soa
247 list-aa-soa
248 list-aa-aos
249 list-pull-split-nt-1s-soa
250 list-pull-split-nt-2s-soa
251 list-push-soa
252 list-push-aos
253 list-pull-soa
254 list-pull-aos
255 push-soa
256 push-aos
257 pull-soa
258 pull-aos
259 blk-push-soa
260 blk-push-aos
261 blk-pull-soa
262 blk-pull-aos
263
e3f82424
MW
264Kernels
265-------
266
267The following list shortly describes available kernels:
268
0fde6e45 269- **push-soa/push-aos/pull-soa/pull-aos**:
e3f82424
MW
270 Unoptimized kernels (but stream/collide are already fused) using two grids as
271 source and destination. Implement push/pull semantics as well structure of
272 arrays (soa) or array of structures (aos) layout.
273
0fde6e45 274- **blk-push-soa/blk-push-aos/blk-pull-soa/blk-pull-aos**:
e3f82424
MW
275 The same as the unoptimized kernels without the blk prefix, except that they support
276 spatial blocking, i.e. loop blocking of the three loops used to iterate over
277 the lattice. Here manual work sharing for OpenMP is used.
278
0fde6e45
MW
279- **aa-aos/aa-soa**:
280 Straight forward implementation of AA pattern on full array with blocking support.
281 Manual work sharing for OpenMP is used. Domain is partitioned only along the x dimension.
282
283- **aa-vec-soa/aa-vec-sl-soa**:
284 Optimized AA kernel with intrinsics on full array. aa-vec-sl-soa uses only
285 one loop for iterating over the lattice instead of three nested ones.
286
287- **list-push-soa/list-push-aos/list-pull-soa/list-pull-aos**:
e3f82424
MW
288 The same as the unoptimized kernels without the list prefix, but for indirect addressing.
289 Here only a 1D vector of is used to store the fluid nodes, omitting the
290 obstacles. An adjacency list is used to recover the neighborhood associations.
291
0fde6e45 292- **list-pull-split-nt-1s-soa/list-pull-split-nt-2s-soa**:
e3f82424
MW
293 Optimized variant of list-pull-soa. Chunks of the lattice are processed as
294 once. Postcollision values are written back via nontemporal stores in 18 (1s)
295 or 9 (2s) loops.
296
0fde6e45 297- **list-aa-aos/list-aa-soa**:
e3f82424
MW
298 Unoptimized implementation of the AA pattern for the 1D vector with adjacency
299 list. Supported are array of structures (aos) and structure of arrays (soa)
300 data layout is supported.
301
0fde6e45 302- **list-aa-ria-soa**:
e3f82424
MW
303 Implementation of AA pattern with intrinsics for the 1D vector with adjacency
304 list. Furthermore it contains a vectorized even time step and run length
305 coding to reduce the loop balance of the odd time step.
306
0fde6e45 307- **list-aa-pv-soa**:
e3f82424
MW
308 All optimizations of list-aa-ria-soa. Additional with partial vectorization
309 of the odd time step.
310
311
312Note that all array of structures (aos) kernels might require blocking
313(depending on the domain size) to reach the performance of their structure of
314arrays (soa) counter parts.
315
316The following table summarizes the properties of the kernels. Here **D** means
317direct addressing, i.e. full array, **I** means indirect addressing, i.e. 1D
318vector with adjacency list, **x** means supported, whereas **--** means unsupported.
0fde6e45 319The loop balance B_l is computed for D3Q19 model with **double precision** floating
e3f82424
MW
320point for PDFs (8 byte) and 4 byte integers for the index (adjacency list).
321As list-aa-ria-soa and list-aa-pv-soa support run length coding their effective
322loop balance depends on the geometry. The effective loop balance is printed
323during each run.
324
325
326====================== =========== =========== ===== ======== ======== ============
327kernel name prop. step data layout addr. parallel blocking B_l [B/FLUP]
328====================== =========== =========== ===== ======== ======== ============
329push-soa OS SoA D x -- 456
330push-aos OS AoS D x -- 456
331pull-soa OS SoA D x -- 456
332pull-aos OS AoS D x -- 456
333blk-push-soa OS SoA D x x 456
334blk-push-aos OS AoS D x x 456
335blk-pull-soa OS SoA D x x 456
336blk-pull-aos OS AoS D x x 456
0fde6e45
MW
337aa-soa AA SoA D x x 304
338aa-aos AA AoS D x x 304
339aa-vec-soa AA SoA D x x 304
340aa-vec-sl-soa AA SoA D x x 304
e3f82424
MW
341list-push-soa OS SoA I x x 528
342list-push-aos OS AoS I x x 528
343list-pull-soa OS SoA I x x 528
344list-pull-aos OS AoS I x x 528
345list-pull-split-nt-1s OS SoA I x x 376
346list-pull-split-nt-2s OS SoA I x x 376
347list-aa-soa AA SoA I x x 340
348list-aa-aos AA AoS I x x 340
349list-aa-ria-soa AA SoA I x x 304-342
350list-aa-pv-soa AA SoA I x x 304-342
351====================== =========== =========== ===== ======== ======== ============
10988083
MW
352
353Benchmarking
354============
355
356Correct benchmarking is a nontrivial task. Whenever benchmark results should be
357created make sure the binary was compiled with:
358
e3f82424
MW
359- ``BENCHMARK=on`` (default if not overriden) and
360- ``BUILD=release`` (default if not overriden) and
8cafd9ea 361- the correct ISA for macros (i.e. intrinsics) is used, selected via ``ISA`` and
10988083 362- use ``TARCH`` to specify the architecture the compiler generates code for.
0095f461
MW
363
364Intel Compiler
365--------------
366
367For the Intel compiler one can specify depending on the target ISA extension:
368
8cafd9ea 369- SSE: ``TARCH=-xSSE4.2``
0095f461
MW
370- AVX: ``TARCH=-xAVX``
371- AVX2 and FMA: ``TARCH=-xCORE-AVX2,-fma``
372- AVX512: ``TARCH=-xCORE-AVX512``
373- KNL: ``TARCH=-xMIC-AVX512``
374
375Compiling for an architecture supporting AVX (Sandy Bridge, Ivy Bridge): ::
376
377 make ISA=avx TARCH=-xAVX
378
379
380Compiling for an architecture supporting AVX2 (Haswell, Broadwell): ::
381
382 make ISA=avx TARCH=-xCORE-AVX2,-fma
383
384WARNING: ISA is here still set to ``avx`` as currently we have the FMA intrinsics not
385implemented. This might change in the future.
386
387
8cafd9ea
MW
388.. TODO: add isa=avx512 and add docu for knl
389
390.. TODO: kein prefetching wenn AVX-512 PF nicht unterstuetz wird
391
0095f461
MW
392Compiling for an architecture supporting AVX-512 (Skylake): ::
393
8cafd9ea
MW
394 make ISA=avx512 TARCH=-xCORE-AVX512
395
396Please note that for the AVX512 gather kernels software prefetching for the
397gather instructions is disabled per default.
398To enable it set ``SOFTWARE_PREFETCH_LOOKAHEAD_L1`` and/or
399``SOFTWARE_PREFETCH_LOOKAHEAD_L2`` to a value greater than ``0`` during
400compilation. Note that this requires AVX-512 PF support from the target
401processor.
402
403Compiling for MIC architecture KNL supporting AVX-512 and AVX-512 PF::
404
405 make ISA=avx512 TARCH=-xMIC-AVX512
406
407or optionally with software prefetch enabled::
408
409 make ISA=avx512 TARCH=-xMIC-AVX512 SOFTWARE_PREFETCH_LOOKAHEAD_L1=<value> SOFTWARE_PREFETCH_LOOKAHEAD_L2=<value>
410
0095f461 411
0095f461
MW
412
413
414Pinning
415-------
10988083
MW
416
417During benchmarking pinning should be used via the ``-pin`` parameter. Running
0095f461 418a benchmark with 10 threads and pin them to the first 10 cores works like ::
10988083 419
0fde6e45 420 $ bin/lbmbenchk-linux-intel-release-dp ... -t 10 -pin $(seq -s , 0 9)
10988083 421
0095f461
MW
422
423General Remarks
424---------------
425
426Things the binary does nor check or control:
10988083
MW
427
428- transparent huge pages: when allocating memory small 4 KiB pages might be
429 replaced with larger ones. This is in general a good thing, but if this is
e3f82424
MW
430 really the case, depends on the system settings (check e.g. the status of
431 ``/sys/kernel/mm/transparent_hugepage/enabled``).
432 Currently ``madvise(MADV_HUGEPAGE)`` is used for allocations which are aligned to
433 a 4 KiB page, which should be the case for the lattices.
434 This should result in huge pages except THP is disabled on the machine.
435 (NOTE: madvise() is used if ``HAVE_HUGE_PAGES`` is defined, which is currently
436 hard coded defined in ``Memory.c``).
10988083
MW
437
438- CPU/core frequency: For reproducible results the frequency of all cores
439 should be fixed.
440
441- NUMA placement policy: The benchmark assumes a first touch policy, which
442 means the memory will be placed at the NUMA domain the touching core is
443 associated with. If a different policy is in place or the NUMA domain to be
444 used is already full memory might be allocated in a remote domain. Accesses
445 to remote domains typically have a higher latency and lower bandwidth.
446
0095f461 447- System load: interference with other application, especially on desktop
10988083
MW
448 systems should be avoided.
449
e3f82424
MW
450- Padding: For SoA based kernels the number of (fluid) nodes is automatically
451 adjusted so that no cache or TLB thrashing should occur. The parameters are
452 optimized for current Intel based systems. For more details look into the
453 padding section.
10988083
MW
454
455- CPU dispatcher function: the compiler might add different versions of a
456 function for different ISA extensions. Make sure the code you might think is
457 executed is actually the code which is executed.
458
e3f82424
MW
459Padding
460-------
461
462With correct padding cache and TLB thrashing can be avoided. Therefore the
463number of (fluid) nodes used in the data layout is artificially increased.
464
465Currently automatic padding is active for kernels which support it. It can be
466controlled via the kernel parameter (i.e. parameter after the ``--``)
467``-pad``. Supported values are ``auto`` (default), ``no`` (to disable padding),
468or a manual padding.
469
470Automatic padding tries to avoid cache and TLB thrashing and pads for a 32
471entry (huge pages) TLB with 8 sets and a 512 set (L2) cache. This reflects the
472parameters of current Intel based processors.
473
474Manual padding is done via a padding string and has the format
475``mod_1+offset_1(,mod_n+offset_n)``, which specifies numbers of bytes.
476SoA data layouts can exhibit TLB thrashing. Therefore we want to distribute the
47719 pages with one lattice (36 with two lattices) we are concurrently accessing
478over as much sets in the TLB as possible.
479This is controlled by the distance between the accessed pages, which is the
480number of (fluid) nodes in between them and can be adjusted by adding further
481(fluid) nodes.
482We want the distance d (in bytes) between two accessed pages to be e.g.
483**d % (PAGE_SIZE * TLB_SETS) = PAGE_SIZE**.
484This would distribute the pages evenly over the sets. Hereby **PAGE_SIZE * TLB_SETS**
485would be our ``mod_1`` and **PAGE_SIZE** (after the =) our ``offset_1``.
486Measurements show that with only a quarter of half of a page size as offset
487higher performance is achieved, which is done by automatic padding.
488On top of this padding more paddings can be added. They are just added to the
489padding string and are separated by commas.
490
491A zero modulus in the padding string has a special meaning. Here the
492corresponding offset is just added to the number of nodes. A padding string
493like ``-pad 0+16`` would at a static padding of two nodes (one node = 8 b).
494
495
496Geometries
497==========
498
0095f461
MW
499TODO: supported geometries: channel, pipe, blocks, fluid
500
501
502Performance Results
503===================
504
505The sections lists performance values measured on several machines for
0fde6e45 506different kernels and geometries and **double precision** floating point data/arithmetic.
0095f461
MW
507The **RFM** column denotes the expected performance as predicted by the
508Roofline performance model [williams-2008]_.
509For performance prediction of each kernel a memory bandwidth benchmark is used
510which mimics the kernels memory access pattern and the kernel's loop balance
511(see [kernels]_ for details).
512
0fde6e45
MW
513Machine Specifications
514----------------------
515
516**Ivy Bridge, Intel Xeon E5-2660 v2**
517
518- Ivy Bridge architecture, AVX
519- 10 cores, 2.2 GHz
520- SMT enabled
521- memoy bandwidth:
522
523 - copy-19 32.7 GB/s
524 - copy-19-nt-sl 35.6 GB/s
525 - update-19 37.4 GB/s
526
527**Haswell, Intel Xeon E5-2695 v3**
0095f461
MW
528
529- Haswell architecture, AVX2, FMA
0fde6e45 530- 14 cores, 2.3 GHz
0095f461
MW
531- 2 x 7 cores in cluster-on-die (CoD) mode enabled
532- SMT enabled
0fde6e45 533- memory bandwidth:
0095f461 534
0fde6e45
MW
535 - copy-19 47.3 GB/s
536 - copy-19-nt-sl 47.1 GB/s
537 - update-19 44.0 GB/s
538
539
540**Broadwell, Intel Xeon E5-2630 v4**
0095f461
MW
541
542- Broadwell architecture, AVX2, FMA
543- 10 cores, 2.2 GHz
544- SMT disabled
0fde6e45
MW
545- memory bandwidth:
546
547 - copy-19 48.0 GB/s
548 - copy-nt-sl-19 48.2 GB/s
549 - update-19 51.1 GB/s
550
551**Skylake, Intel Xeon Gold 6148**
0095f461 552
0fde6e45
MW
553NOTE: currently we only use AVX2 intrinsics.
554
555- Skylake server architecture, AVX2, AVX512, 2 FMA units
0095f461
MW
556- 20 cores, 2.4 GHz
557- SMT enabled
0fde6e45
MW
558- memory bandwidth:
559
560 - copy-19 89.7 GB/s
561 - copy-19-nt-sl 92.4 GB/s
562 - update-19 93.6 GB/s
563
564**Zen, AMD EPYC 7451**
565
566- Zen architecture, AVX2, FMA
567- 24 cores, 2.3 GHz
568- SMT enabled
569- memory bandwidth:
570
571 - copy-19 111.9 GB/s
572 - copy-19-nt-sl 111.7 GB/s
573 - update-19 109.2 GB/s
574
575**Zen, AMD Ryzen 7 1700X**
576
577- Zen architecture, AVX2, FMA
578- 8 cores, 3.4 GHz
579- SMT enabled
580- memory bandwidth:
581
582 - copy-19 27.2 GB/s
583 - copy-19-nt-sl 27.1 GB/s
584 - update-19 26.1 GB/s
585
586Single Socket Results
587---------------------
588
589- Geometry dimensions are for all measurements 500x100x100 nodes.
590- Note the **different scaling on the y axis** of the plots!
591
592.. |perf_emmy_dp| image:: images/benchmark-emmy-dp.png
593 :scale: 50 %
594.. |perf_emmy_sp| image:: images/benchmark-emmy-sp.png
595 :scale: 50 %
596.. |perf_hasep1_dp| image:: images/benchmark-hasep1-dp.png
597 :scale: 50 %
598.. |perf_hasep1_sp| image:: images/benchmark-hasep1-sp.png
599 :scale: 50 %
600.. |perf_meggie_dp| image:: images/benchmark-meggie-dp.png
601 :scale: 50 %
602.. |perf_meggie_sp| image:: images/benchmark-meggie-sp.png
603 :scale: 50 %
604.. |perf_skylakesp2_dp| image:: images/benchmark-skylakesp2-dp.png
605 :scale: 50 %
606.. |perf_skylakesp2_sp| image:: images/benchmark-skylakesp2-sp.png
607 :scale: 50 %
608.. |perf_summitridge1_dp| image:: images/benchmark-summitridge1-dp.png
609 :scale: 50 %
610.. |perf_summitridge1_sp| image:: images/benchmark-summitridge1-sp.png
611 :scale: 50 %
612.. |perf_naples1_dp| image:: images/benchmark-naples1-dp.png
613 :scale: 50 %
614.. |perf_naples1_sp| image:: images/benchmark-naples1-sp.png
615 :scale: 50 %
616
617.. list-table::
618
619 * - Ivy Bridge, Intel Xeon E5-2660 v2, Double Precision
620 * - |perf_emmy_dp|
621 * - Ivy Bridge, Intel Xeon E5-2660 v2, Single Precision
622 * - |perf_emmy_sp|
623 * - Haswell, Intel Xeon E5-2695 v3, Double Precision
624 * - |perf_hasep1_dp|
625 * - Haswell, Intel Xeon E5-2695 v3, Single Precision
626 * - |perf_hasep1_sp|
627 * - Broadwell, Intel Xeon E5-2630 v4, Double Precision
628 * - |perf_meggie_dp|
629 * - Broadwell, Intel Xeon E5-2630 v4, Single Precision
630 * - |perf_meggie_sp|
631 * - Skylake, Intel Xeon Gold 6148, Double Precision, **NOTE: currently we only use AVX2 intrinsics.**
632 * - |perf_skylakesp2_dp|
633 * - Skylake, Intel Xeon Gold 6148, Single Precision, **NOTE: currently we only use AVX2 intrinsics.**
634 * - |perf_skylakesp2_sp|
635 * - Zen, AMD Ryzen 7 1700X, Double Precision
636 * - |perf_summitridge1_dp|
637 * - Zen, AMD Ryzen 7 1700X, Single Precision
638 * - |perf_summitridge1_sp|
639 * - Zen, AMD EPYC 7451, Double Precision
640 * - |perf_naples1_dp|
641 * - Zen, AMD EPYC 7451, Single Precision
642 * - |perf_naples1_sp|
0095f461 643
e3f82424
MW
644
645Licence
646=======
647
648The Lattice Boltzmann Benchmark Kernels are licensed under GPLv3.
649
4e91c4b6
MW
650
651Acknowledgements
652================
653
654This work was funded by BMBF, grant no. 01IH15003A (project SKAMPY).
655
656This work was funded by KONWHIR project OMI4PAPS.
657
658
0095f461
MW
659Bibliography
660============
661
662.. [ginzburg-2008]
663 I. Ginzburg, F. Verhaeghe, and D. d'Humières.
664 Two-relaxation-time lattice Boltzmann scheme: About parametrization, velocity, pressure and mixed boundary conditions.
665 Commun. Comput. Phys., 3(2):427-478, 2008.
666
667.. [williams-2008]
668 S. Williams, A. Waterman, and D. Patterson.
669 Roofline: an insightful visual performance model for multicore architectures.
670 Commun. ACM, 52(4):65-76, Apr 2009. doi:10.1145/1498765.1498785
671
4e91c4b6 672
10988083
MW
673.. |datetime| date:: %Y-%m-%d %H:%M
674
675Document was generated at |datetime|.
676
This page took 0.198807 seconds and 5 git commands to generate.