add single precision, add aa-vec-sl-soa kernel, updated doc
[LbmBenchmarkKernelsPublic.git] / doc / main.rst
CommitLineData
0fde6e45
MW
1
2| Copyright
3| Markus Wittmann, 2016-2018
4| RRZE, University of Erlangen-Nuremberg, Germany
5| markus.wittmann -at- fau.de or hpc -at- rrze.fau.de
6|
7| Viktor Haag, 2016
8| LSS, University of Erlangen-Nuremberg, Germany
9|
10| This file is part of the Lattice Boltzmann Benchmark Kernels (LbmBenchKernels).
11|
12| LbmBenchKernels is free software: you can redistribute it and/or modify
13| it under the terms of the GNU General Public License as published by
14| the Free Software Foundation, either version 3 of the License, or
15| (at your option) any later version.
16|
17| LbmBenchKernels is distributed in the hope that it will be useful,
18| but WITHOUT ANY WARRANTY; without even the implied warranty of
19| MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
20| GNU General Public License for more details.
21|
22| You should have received a copy of the GNU General Public License
23| along with LbmBenchKernels. If not, see <http://www.gnu.org/licenses/>.
10988083
MW
24
25.. title:: LBM Benchmark Kernels Documentation
26
27
0fde6e45 28**LBM Benchmark Kernels Documentation**
10988083
MW
29
30.. sectnum::
31.. contents::
32
0095f461
MW
33Introduction
34============
35
36The lattice Boltzmann (LBM) benchmark kernels are a collection of LBM kernel
37implementations.
38
39**AS SUCH THE LBM BENCHMARK KERNELS ARE NO FULLY EQUIPPED CFD SOLVER AND SOLELY
40SERVES THE PURPOSE OF STUDYING POSSIBLE PERFORMANCE OPTIMIZATIONS AND/OR
41EXPERIMENTS.**
42
43Currently all kernels utilize a D3Q19 discretization and the
44two-relaxation-time (TRT) collision operator [ginzburg-2008]_.
0fde6e45 45All operations are carried out in double or single precision arithmetic.
0095f461 46
10988083
MW
47Compilation
48===========
49
50The benchmark framework currently supports only Linux systems and the GCC and
51Intel compilers. Every other configuration probably requires adjustment inside
0095f461 52the code and the makefiles. Furthermore some code might be platform or at least
10988083
MW
53POSIX specific.
54
55The benchmark can be build via ``make`` from the ``src`` subdirectory. This will
56generate one binary which hosts all implemented benchmark kernels.
57
58Binaries are located under the ``bin`` subdirectory and will have different names
59depending on compiler and build configuration.
60
0095f461
MW
61Compilation can target debug or release builds. Combined with both build types
62verification can be enabled, which increases the runtime and hence is not
63suited for benchmarking.
64
65
10988083
MW
66Debug and Verification
67----------------------
68
69::
70
e3f82424 71 make BUILD=debug BENCHMARK=off
10988083 72
e3f82424 73Running ``make`` with ``BUILD=debug`` builds the debug version of
10988083
MW
74the benchmark kernels, where no optimizations are performed, line numbers and
75debug symbols are included as well as ``DEBUG`` will be defined. The resulting
76binary will be found in the ``bin`` subdirectory and named
77``lbmbenchk-linux-<compiler>-debug``.
78
e3f82424
MW
79Specifying ``BENCHMARK=off`` turns on verification
80(``VERIFICATION=on``), statistics (``STATISTICS=on``), and VTK output
10988083
MW
81(``VTK_OUTPUT=on``) enabled.
82
83Please note that the generated binary will therefore
84exhibit a poor performance.
85
0095f461
MW
86
87Release and Verification
88------------------------
89
90Verification with the debug builds can be extremely slow. Hence verification
91capabilities can be build with release builds: ::
92
93 make BENCHMARK=off
94
95
10988083
MW
96Benchmarking
97------------
98
99To generate a binary for benchmarking run make with ::
100
e3f82424 101 make
10988083 102
e3f82424 103As default ``BENCHMARK=on`` and ``BUILD=release`` is set, where
0095f461 104``BUILD=release`` turns optimizations on and ``BENCHMARK=on`` disables
10988083
MW
105verfification, statistics, and VTK output.
106
0095f461
MW
107See Options Summary below for further description of options which can be
108applied, e.g. TARCH as well as the Benchmarking section.
10988083
MW
109
110Compilers
111---------
112
113Currently only the GCC and Intel compiler under Linux are supported. Between
114both configuration can be chosen via ``CONFIG=linux-gcc`` or
115``CONFIG=linux-intel``.
116
e3f82424 117
0fde6e45
MW
118Floating Point Precision
119------------------------
120
121As default double precision data types are used for storing PDFs and floating
122point constants. Furthermore, this is the default for the intrincis kernels.
123With the ``PRECISION=sp`` variable this can be changed to single precision. ::
124
125 make PRECISION=sp # build for single precision kernels
126
127 make PRECISION=dp # build for double precision kernels (defalt)
128
129
e3f82424
MW
130Cleaning
131--------
132
133For each configuration and build (debug/release) a subdirectory under the
134``src/obj`` directory is created where the dependency and object files are
135stored.
136With ::
137
138 make CONFIG=... BUILD=... clean
139
140a specific combination is select and cleaned, whereas with ::
141
142 make clean-all
143
144all object and dependency files are deleted.
145
146
10988083
MW
147Options Summary
148---------------
149
0095f461 150Options that can be specified when building the suite with make:
10988083
MW
151
152============= ======================= ============ ==========================================================
153name values default description
0095f461 154============= ======================= ============ ==========================================================
e3f82424 155BENCHMARK on, off on If enabled, disables VERIFICATION, STATISTICS, VTK_OUTPUT. If disabled enables the three former options.
0095f461 156BUILD debug, release release debug: no optimization, debug symbols, DEBUG defined. release: optimizations enabled.
10988083 157CONFIG linux-gcc, linux-intel linux-intel Select GCC or Intel compiler.
0095f461 158ISA avx, sse avx Determines which ISA extension is used for macro definitions of the intrinsics. This is *not* the architecture the compiler generates code for.
10988083 159OPENMP on, off on OpenMP, i.\,e.\. threading support.
0fde6e45 160PRECISION dp, sp dp Floating point precision used for data type, arithmetic, and intrincics.
10988083 161STATISTICS on, off off View statistics, like density etc, during simulation.
e3f82424 162TARCH -- -- Via TARCH the architecture the compiler generates code for can be overridden. The value depends on the chosen compiler.
10988083
MW
163VERIFICATION on, off off Turn verification on/off.
164VTK_OUTPUT on, off off Enable/Disable VTK file output.
165============= ======================= ============ ==========================================================
166
167Invocation
168==========
169
e3f82424 170Running the binary will print among the GPL licence header a line like the following: ::
10988083
MW
171
172 LBM Benchmark Kernels 0.1, compiled Jul 5 2017 21:59:22, type: verification
173
e3f82424 174if verfication was enabled during compilation or ::
10988083
MW
175
176 LBM Benchmark Kernels 0.1, compiled Jul 5 2017 21:59:22, type: benchmark
177
178if verfication was disabled during compilation.
179
180Command Line Parameters
181-----------------------
182
183Running the binary with ``-h`` list all available parameters: ::
184
185 Usage:
186 ./lbmbenchk -list
187 ./lbmbenchk
188 [-dims XxYyZ] [-geometry box|channel|pipe|blocks[-<block size>]] [-iterations <iterations>] [-lattice-dump-ascii]
189 [-rho-in <density>] [-rho-out <density] [-omega <omega>] [-kernel <kernel>]
190 [-periodic-x]
191 [-t <number of threads>]
192 [-pin core{,core}*]
193 [-verify]
194 -- <kernel specific parameters>
195
196 -list List available kernels.
197
198 -dims XxYxZ Specify geometry dimensions.
199
200 -geometry blocks-<block size>
201 Geometetry with blocks of size <block size> regularily layout out.
202
203
204If an option is specified multiple times the last one overrides previous ones.
205This holds also true for ``-verify`` which sets geometry dimensions,
206iterations, etc, which can afterward be override, e.g.: ::
207
0fde6e45 208 $ bin/lbmbenchk-linux-intel-release-dp -verfiy -dims 32x32x32
10988083 209
0095f461 210Kernel specific parameters can be obtained via selecting the specific kernel
10988083
MW
211and passing ``-h`` as parameter: ::
212
0fde6e45 213 $ bin/lbmbenchk-linux-intel-release-dp -kernel kernel-name -- -h
10988083
MW
214 ...
215 Kernel parameters:
216 [-blk <n>] [-blk-[xyz] <n>]
217
218
219A list of all available kernels can be obtained via ``-list``: ::
220
0fde6e45 221 $ ../bin/lbmbenchk-linux-gcc-debug-dp -list
10988083
MW
222 Lattice Boltzmann Benchmark Kernels (LbmBenchKernels) Copyright (C) 2016, 2017 LSS, RRZE
223 This program comes with ABSOLUTELY NO WARRANTY; for details see LICENSE.
224 This is free software, and you are welcome to redistribute it under certain conditions.
225
226 LBM Benchmark Kernels 0.1, compiled Jul 5 2017 21:59:22, type: verification
227 Available kernels to benchmark:
228 list-aa-pv-soa
229 list-aa-ria-soa
230 list-aa-soa
231 list-aa-aos
232 list-pull-split-nt-1s-soa
233 list-pull-split-nt-2s-soa
234 list-push-soa
235 list-push-aos
236 list-pull-soa
237 list-pull-aos
238 push-soa
239 push-aos
240 pull-soa
241 pull-aos
242 blk-push-soa
243 blk-push-aos
244 blk-pull-soa
245 blk-pull-aos
246
e3f82424
MW
247Kernels
248-------
249
250The following list shortly describes available kernels:
251
0fde6e45 252- **push-soa/push-aos/pull-soa/pull-aos**:
e3f82424
MW
253 Unoptimized kernels (but stream/collide are already fused) using two grids as
254 source and destination. Implement push/pull semantics as well structure of
255 arrays (soa) or array of structures (aos) layout.
256
0fde6e45 257- **blk-push-soa/blk-push-aos/blk-pull-soa/blk-pull-aos**:
e3f82424
MW
258 The same as the unoptimized kernels without the blk prefix, except that they support
259 spatial blocking, i.e. loop blocking of the three loops used to iterate over
260 the lattice. Here manual work sharing for OpenMP is used.
261
0fde6e45
MW
262- **aa-aos/aa-soa**:
263 Straight forward implementation of AA pattern on full array with blocking support.
264 Manual work sharing for OpenMP is used. Domain is partitioned only along the x dimension.
265
266- **aa-vec-soa/aa-vec-sl-soa**:
267 Optimized AA kernel with intrinsics on full array. aa-vec-sl-soa uses only
268 one loop for iterating over the lattice instead of three nested ones.
269
270- **list-push-soa/list-push-aos/list-pull-soa/list-pull-aos**:
e3f82424
MW
271 The same as the unoptimized kernels without the list prefix, but for indirect addressing.
272 Here only a 1D vector of is used to store the fluid nodes, omitting the
273 obstacles. An adjacency list is used to recover the neighborhood associations.
274
0fde6e45 275- **list-pull-split-nt-1s-soa/list-pull-split-nt-2s-soa**:
e3f82424
MW
276 Optimized variant of list-pull-soa. Chunks of the lattice are processed as
277 once. Postcollision values are written back via nontemporal stores in 18 (1s)
278 or 9 (2s) loops.
279
0fde6e45 280- **list-aa-aos/list-aa-soa**:
e3f82424
MW
281 Unoptimized implementation of the AA pattern for the 1D vector with adjacency
282 list. Supported are array of structures (aos) and structure of arrays (soa)
283 data layout is supported.
284
0fde6e45 285- **list-aa-ria-soa**:
e3f82424
MW
286 Implementation of AA pattern with intrinsics for the 1D vector with adjacency
287 list. Furthermore it contains a vectorized even time step and run length
288 coding to reduce the loop balance of the odd time step.
289
0fde6e45 290- **list-aa-pv-soa**:
e3f82424
MW
291 All optimizations of list-aa-ria-soa. Additional with partial vectorization
292 of the odd time step.
293
294
295Note that all array of structures (aos) kernels might require blocking
296(depending on the domain size) to reach the performance of their structure of
297arrays (soa) counter parts.
298
299The following table summarizes the properties of the kernels. Here **D** means
300direct addressing, i.e. full array, **I** means indirect addressing, i.e. 1D
301vector with adjacency list, **x** means supported, whereas **--** means unsupported.
0fde6e45 302The loop balance B_l is computed for D3Q19 model with **double precision** floating
e3f82424
MW
303point for PDFs (8 byte) and 4 byte integers for the index (adjacency list).
304As list-aa-ria-soa and list-aa-pv-soa support run length coding their effective
305loop balance depends on the geometry. The effective loop balance is printed
306during each run.
307
308
309====================== =========== =========== ===== ======== ======== ============
310kernel name prop. step data layout addr. parallel blocking B_l [B/FLUP]
311====================== =========== =========== ===== ======== ======== ============
312push-soa OS SoA D x -- 456
313push-aos OS AoS D x -- 456
314pull-soa OS SoA D x -- 456
315pull-aos OS AoS D x -- 456
316blk-push-soa OS SoA D x x 456
317blk-push-aos OS AoS D x x 456
318blk-pull-soa OS SoA D x x 456
319blk-pull-aos OS AoS D x x 456
0fde6e45
MW
320aa-soa AA SoA D x x 304
321aa-aos AA AoS D x x 304
322aa-vec-soa AA SoA D x x 304
323aa-vec-sl-soa AA SoA D x x 304
e3f82424
MW
324list-push-soa OS SoA I x x 528
325list-push-aos OS AoS I x x 528
326list-pull-soa OS SoA I x x 528
327list-pull-aos OS AoS I x x 528
328list-pull-split-nt-1s OS SoA I x x 376
329list-pull-split-nt-2s OS SoA I x x 376
330list-aa-soa AA SoA I x x 340
331list-aa-aos AA AoS I x x 340
332list-aa-ria-soa AA SoA I x x 304-342
333list-aa-pv-soa AA SoA I x x 304-342
334====================== =========== =========== ===== ======== ======== ============
10988083
MW
335
336Benchmarking
337============
338
339Correct benchmarking is a nontrivial task. Whenever benchmark results should be
340created make sure the binary was compiled with:
341
e3f82424
MW
342- ``BENCHMARK=on`` (default if not overriden) and
343- ``BUILD=release`` (default if not overriden) and
10988083
MW
344- the correct ISA for macros is used, selected via ``ISA`` and
345- use ``TARCH`` to specify the architecture the compiler generates code for.
0095f461
MW
346
347Intel Compiler
348--------------
349
350For the Intel compiler one can specify depending on the target ISA extension:
351
352- AVX: ``TARCH=-xAVX``
353- AVX2 and FMA: ``TARCH=-xCORE-AVX2,-fma``
354- AVX512: ``TARCH=-xCORE-AVX512``
355- KNL: ``TARCH=-xMIC-AVX512``
356
357Compiling for an architecture supporting AVX (Sandy Bridge, Ivy Bridge): ::
358
359 make ISA=avx TARCH=-xAVX
360
361
362Compiling for an architecture supporting AVX2 (Haswell, Broadwell): ::
363
364 make ISA=avx TARCH=-xCORE-AVX2,-fma
365
366WARNING: ISA is here still set to ``avx`` as currently we have the FMA intrinsics not
367implemented. This might change in the future.
368
369
370Compiling for an architecture supporting AVX-512 (Skylake): ::
371
372 make ISA=avx TARCH=-xCORE-AVX512
373
374WARNING: ISA is here still set to ``avx`` as currently we have no implementation for the
375AVX512 intrinsics. This might change in the future.
376
377
378Pinning
379-------
10988083
MW
380
381During benchmarking pinning should be used via the ``-pin`` parameter. Running
0095f461 382a benchmark with 10 threads and pin them to the first 10 cores works like ::
10988083 383
0fde6e45 384 $ bin/lbmbenchk-linux-intel-release-dp ... -t 10 -pin $(seq -s , 0 9)
10988083 385
0095f461
MW
386
387General Remarks
388---------------
389
390Things the binary does nor check or control:
10988083
MW
391
392- transparent huge pages: when allocating memory small 4 KiB pages might be
393 replaced with larger ones. This is in general a good thing, but if this is
e3f82424
MW
394 really the case, depends on the system settings (check e.g. the status of
395 ``/sys/kernel/mm/transparent_hugepage/enabled``).
396 Currently ``madvise(MADV_HUGEPAGE)`` is used for allocations which are aligned to
397 a 4 KiB page, which should be the case for the lattices.
398 This should result in huge pages except THP is disabled on the machine.
399 (NOTE: madvise() is used if ``HAVE_HUGE_PAGES`` is defined, which is currently
400 hard coded defined in ``Memory.c``).
10988083
MW
401
402- CPU/core frequency: For reproducible results the frequency of all cores
403 should be fixed.
404
405- NUMA placement policy: The benchmark assumes a first touch policy, which
406 means the memory will be placed at the NUMA domain the touching core is
407 associated with. If a different policy is in place or the NUMA domain to be
408 used is already full memory might be allocated in a remote domain. Accesses
409 to remote domains typically have a higher latency and lower bandwidth.
410
0095f461 411- System load: interference with other application, especially on desktop
10988083
MW
412 systems should be avoided.
413
e3f82424
MW
414- Padding: For SoA based kernels the number of (fluid) nodes is automatically
415 adjusted so that no cache or TLB thrashing should occur. The parameters are
416 optimized for current Intel based systems. For more details look into the
417 padding section.
10988083
MW
418
419- CPU dispatcher function: the compiler might add different versions of a
420 function for different ISA extensions. Make sure the code you might think is
421 executed is actually the code which is executed.
422
e3f82424
MW
423Padding
424-------
425
426With correct padding cache and TLB thrashing can be avoided. Therefore the
427number of (fluid) nodes used in the data layout is artificially increased.
428
429Currently automatic padding is active for kernels which support it. It can be
430controlled via the kernel parameter (i.e. parameter after the ``--``)
431``-pad``. Supported values are ``auto`` (default), ``no`` (to disable padding),
432or a manual padding.
433
434Automatic padding tries to avoid cache and TLB thrashing and pads for a 32
435entry (huge pages) TLB with 8 sets and a 512 set (L2) cache. This reflects the
436parameters of current Intel based processors.
437
438Manual padding is done via a padding string and has the format
439``mod_1+offset_1(,mod_n+offset_n)``, which specifies numbers of bytes.
440SoA data layouts can exhibit TLB thrashing. Therefore we want to distribute the
44119 pages with one lattice (36 with two lattices) we are concurrently accessing
442over as much sets in the TLB as possible.
443This is controlled by the distance between the accessed pages, which is the
444number of (fluid) nodes in between them and can be adjusted by adding further
445(fluid) nodes.
446We want the distance d (in bytes) between two accessed pages to be e.g.
447**d % (PAGE_SIZE * TLB_SETS) = PAGE_SIZE**.
448This would distribute the pages evenly over the sets. Hereby **PAGE_SIZE * TLB_SETS**
449would be our ``mod_1`` and **PAGE_SIZE** (after the =) our ``offset_1``.
450Measurements show that with only a quarter of half of a page size as offset
451higher performance is achieved, which is done by automatic padding.
452On top of this padding more paddings can be added. They are just added to the
453padding string and are separated by commas.
454
455A zero modulus in the padding string has a special meaning. Here the
456corresponding offset is just added to the number of nodes. A padding string
457like ``-pad 0+16`` would at a static padding of two nodes (one node = 8 b).
458
459
460Geometries
461==========
462
0095f461
MW
463TODO: supported geometries: channel, pipe, blocks, fluid
464
465
466Performance Results
467===================
468
469The sections lists performance values measured on several machines for
0fde6e45 470different kernels and geometries and **double precision** floating point data/arithmetic.
0095f461
MW
471The **RFM** column denotes the expected performance as predicted by the
472Roofline performance model [williams-2008]_.
473For performance prediction of each kernel a memory bandwidth benchmark is used
474which mimics the kernels memory access pattern and the kernel's loop balance
475(see [kernels]_ for details).
476
0fde6e45
MW
477Machine Specifications
478----------------------
479
480**Ivy Bridge, Intel Xeon E5-2660 v2**
481
482- Ivy Bridge architecture, AVX
483- 10 cores, 2.2 GHz
484- SMT enabled
485- memoy bandwidth:
486
487 - copy-19 32.7 GB/s
488 - copy-19-nt-sl 35.6 GB/s
489 - update-19 37.4 GB/s
490
491**Haswell, Intel Xeon E5-2695 v3**
0095f461
MW
492
493- Haswell architecture, AVX2, FMA
0fde6e45 494- 14 cores, 2.3 GHz
0095f461
MW
495- 2 x 7 cores in cluster-on-die (CoD) mode enabled
496- SMT enabled
0fde6e45 497- memory bandwidth:
0095f461 498
0fde6e45
MW
499 - copy-19 47.3 GB/s
500 - copy-19-nt-sl 47.1 GB/s
501 - update-19 44.0 GB/s
502
503
504**Broadwell, Intel Xeon E5-2630 v4**
0095f461
MW
505
506- Broadwell architecture, AVX2, FMA
507- 10 cores, 2.2 GHz
508- SMT disabled
0fde6e45
MW
509- memory bandwidth:
510
511 - copy-19 48.0 GB/s
512 - copy-nt-sl-19 48.2 GB/s
513 - update-19 51.1 GB/s
514
515**Skylake, Intel Xeon Gold 6148**
0095f461 516
0fde6e45
MW
517NOTE: currently we only use AVX2 intrinsics.
518
519- Skylake server architecture, AVX2, AVX512, 2 FMA units
0095f461
MW
520- 20 cores, 2.4 GHz
521- SMT enabled
0fde6e45
MW
522- memory bandwidth:
523
524 - copy-19 89.7 GB/s
525 - copy-19-nt-sl 92.4 GB/s
526 - update-19 93.6 GB/s
527
528**Zen, AMD EPYC 7451**
529
530- Zen architecture, AVX2, FMA
531- 24 cores, 2.3 GHz
532- SMT enabled
533- memory bandwidth:
534
535 - copy-19 111.9 GB/s
536 - copy-19-nt-sl 111.7 GB/s
537 - update-19 109.2 GB/s
538
539**Zen, AMD Ryzen 7 1700X**
540
541- Zen architecture, AVX2, FMA
542- 8 cores, 3.4 GHz
543- SMT enabled
544- memory bandwidth:
545
546 - copy-19 27.2 GB/s
547 - copy-19-nt-sl 27.1 GB/s
548 - update-19 26.1 GB/s
549
550Single Socket Results
551---------------------
552
553- Geometry dimensions are for all measurements 500x100x100 nodes.
554- Note the **different scaling on the y axis** of the plots!
555
556.. |perf_emmy_dp| image:: images/benchmark-emmy-dp.png
557 :scale: 50 %
558.. |perf_emmy_sp| image:: images/benchmark-emmy-sp.png
559 :scale: 50 %
560.. |perf_hasep1_dp| image:: images/benchmark-hasep1-dp.png
561 :scale: 50 %
562.. |perf_hasep1_sp| image:: images/benchmark-hasep1-sp.png
563 :scale: 50 %
564.. |perf_meggie_dp| image:: images/benchmark-meggie-dp.png
565 :scale: 50 %
566.. |perf_meggie_sp| image:: images/benchmark-meggie-sp.png
567 :scale: 50 %
568.. |perf_skylakesp2_dp| image:: images/benchmark-skylakesp2-dp.png
569 :scale: 50 %
570.. |perf_skylakesp2_sp| image:: images/benchmark-skylakesp2-sp.png
571 :scale: 50 %
572.. |perf_summitridge1_dp| image:: images/benchmark-summitridge1-dp.png
573 :scale: 50 %
574.. |perf_summitridge1_sp| image:: images/benchmark-summitridge1-sp.png
575 :scale: 50 %
576.. |perf_naples1_dp| image:: images/benchmark-naples1-dp.png
577 :scale: 50 %
578.. |perf_naples1_sp| image:: images/benchmark-naples1-sp.png
579 :scale: 50 %
580
581.. list-table::
582
583 * - Ivy Bridge, Intel Xeon E5-2660 v2, Double Precision
584 * - |perf_emmy_dp|
585 * - Ivy Bridge, Intel Xeon E5-2660 v2, Single Precision
586 * - |perf_emmy_sp|
587 * - Haswell, Intel Xeon E5-2695 v3, Double Precision
588 * - |perf_hasep1_dp|
589 * - Haswell, Intel Xeon E5-2695 v3, Single Precision
590 * - |perf_hasep1_sp|
591 * - Broadwell, Intel Xeon E5-2630 v4, Double Precision
592 * - |perf_meggie_dp|
593 * - Broadwell, Intel Xeon E5-2630 v4, Single Precision
594 * - |perf_meggie_sp|
595 * - Skylake, Intel Xeon Gold 6148, Double Precision, **NOTE: currently we only use AVX2 intrinsics.**
596 * - |perf_skylakesp2_dp|
597 * - Skylake, Intel Xeon Gold 6148, Single Precision, **NOTE: currently we only use AVX2 intrinsics.**
598 * - |perf_skylakesp2_sp|
599 * - Zen, AMD Ryzen 7 1700X, Double Precision
600 * - |perf_summitridge1_dp|
601 * - Zen, AMD Ryzen 7 1700X, Single Precision
602 * - |perf_summitridge1_sp|
603 * - Zen, AMD EPYC 7451, Double Precision
604 * - |perf_naples1_dp|
605 * - Zen, AMD EPYC 7451, Single Precision
606 * - |perf_naples1_sp|
0095f461 607
e3f82424
MW
608
609Licence
610=======
611
612The Lattice Boltzmann Benchmark Kernels are licensed under GPLv3.
613
4e91c4b6
MW
614
615Acknowledgements
616================
617
618This work was funded by BMBF, grant no. 01IH15003A (project SKAMPY).
619
620This work was funded by KONWHIR project OMI4PAPS.
621
622
0095f461
MW
623Bibliography
624============
625
626.. [ginzburg-2008]
627 I. Ginzburg, F. Verhaeghe, and D. d'Humières.
628 Two-relaxation-time lattice Boltzmann scheme: About parametrization, velocity, pressure and mixed boundary conditions.
629 Commun. Comput. Phys., 3(2):427-478, 2008.
630
631.. [williams-2008]
632 S. Williams, A. Waterman, and D. Patterson.
633 Roofline: an insightful visual performance model for multicore architectures.
634 Commun. ACM, 52(4):65-76, Apr 2009. doi:10.1145/1498765.1498785
635
4e91c4b6 636
10988083
MW
637.. |datetime| date:: %Y-%m-%d %H:%M
638
639Document was generated at |datetime|.
640
This page took 0.245809 seconds and 5 git commands to generate.