1 .. # --------------------------------------------------------------------------
4 # Markus Wittmann, 2016-2017
5 # RRZE, University of Erlangen-Nuremberg, Germany
6 # markus.wittmann -at- fau.de or hpc -at- rrze.fau.de
9 # LSS, University of Erlangen-Nuremberg, Germany
11 # This file is part of the Lattice Boltzmann Benchmark Kernels (LbmBenchKernels).
13 # LbmBenchKernels is free software: you can redistribute it and/or modify
14 # it under the terms of the GNU General Public License as published by
15 # the Free Software Foundation, either version 3 of the License, or
16 # (at your option) any later version.
18 # LbmBenchKernels is distributed in the hope that it will be useful,
19 # but WITHOUT ANY WARRANTY; without even the implied warranty of
20 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
21 # GNU General Public License for more details.
23 # You should have received a copy of the GNU General Public License
24 # along with LbmBenchKernels. If not, see <http://www.gnu.org/licenses/>.
26 # --------------------------------------------------------------------------
28 .. title:: LBM Benchmark Kernels Documentation
31 ===================================
32 LBM Benchmark Kernels Documentation
33 ===================================
41 The lattice Boltzmann (LBM) benchmark kernels are a collection of LBM kernel
44 **AS SUCH THE LBM BENCHMARK KERNELS ARE NO FULLY EQUIPPED CFD SOLVER AND SOLELY
45 SERVES THE PURPOSE OF STUDYING POSSIBLE PERFORMANCE OPTIMIZATIONS AND/OR
48 Currently all kernels utilize a D3Q19 discretization and the
49 two-relaxation-time (TRT) collision operator [ginzburg-2008]_.
50 All operations are carried out in double precision arithmetic.
55 The benchmark framework currently supports only Linux systems and the GCC and
56 Intel compilers. Every other configuration probably requires adjustment inside
57 the code and the makefiles. Furthermore some code might be platform or at least
60 The benchmark can be build via ``make`` from the ``src`` subdirectory. This will
61 generate one binary which hosts all implemented benchmark kernels.
63 Binaries are located under the ``bin`` subdirectory and will have different names
64 depending on compiler and build configuration.
66 Compilation can target debug or release builds. Combined with both build types
67 verification can be enabled, which increases the runtime and hence is not
68 suited for benchmarking.
71 Debug and Verification
72 ----------------------
76 make BUILD=debug BENCHMARK=off
78 Running ``make`` with ``BUILD=debug`` builds the debug version of
79 the benchmark kernels, where no optimizations are performed, line numbers and
80 debug symbols are included as well as ``DEBUG`` will be defined. The resulting
81 binary will be found in the ``bin`` subdirectory and named
82 ``lbmbenchk-linux-<compiler>-debug``.
84 Specifying ``BENCHMARK=off`` turns on verification
85 (``VERIFICATION=on``), statistics (``STATISTICS=on``), and VTK output
86 (``VTK_OUTPUT=on``) enabled.
88 Please note that the generated binary will therefore
89 exhibit a poor performance.
92 Release and Verification
93 ------------------------
95 Verification with the debug builds can be extremely slow. Hence verification
96 capabilities can be build with release builds: ::
104 To generate a binary for benchmarking run make with ::
108 As default ``BENCHMARK=on`` and ``BUILD=release`` is set, where
109 ``BUILD=release`` turns optimizations on and ``BENCHMARK=on`` disables
110 verfification, statistics, and VTK output.
112 See Options Summary below for further description of options which can be
113 applied, e.g. TARCH as well as the Benchmarking section.
118 Currently only the GCC and Intel compiler under Linux are supported. Between
119 both configuration can be chosen via ``CONFIG=linux-gcc`` or
120 ``CONFIG=linux-intel``.
126 For each configuration and build (debug/release) a subdirectory under the
127 ``src/obj`` directory is created where the dependency and object files are
131 make CONFIG=... BUILD=... clean
133 a specific combination is select and cleaned, whereas with ::
137 all object and dependency files are deleted.
143 Options that can be specified when building the suite with make:
145 ============= ======================= ============ ==========================================================
146 name values default description
147 ============= ======================= ============ ==========================================================
148 BENCHMARK on, off on If enabled, disables VERIFICATION, STATISTICS, VTK_OUTPUT. If disabled enables the three former options.
149 BUILD debug, release release debug: no optimization, debug symbols, DEBUG defined. release: optimizations enabled.
150 CONFIG linux-gcc, linux-intel linux-intel Select GCC or Intel compiler.
151 ISA avx, sse avx Determines which ISA extension is used for macro definitions of the intrinsics. This is *not* the architecture the compiler generates code for.
152 OPENMP on, off on OpenMP, i.\,e.\. threading support.
153 STATISTICS on, off off View statistics, like density etc, during simulation.
154 TARCH -- -- Via TARCH the architecture the compiler generates code for can be overridden. The value depends on the chosen compiler.
155 VERIFICATION on, off off Turn verification on/off.
156 VTK_OUTPUT on, off off Enable/Disable VTK file output.
157 ============= ======================= ============ ==========================================================
162 Running the binary will print among the GPL licence header a line like the following: ::
164 LBM Benchmark Kernels 0.1, compiled Jul 5 2017 21:59:22, type: verification
166 if verfication was enabled during compilation or ::
168 LBM Benchmark Kernels 0.1, compiled Jul 5 2017 21:59:22, type: benchmark
170 if verfication was disabled during compilation.
172 Command Line Parameters
173 -----------------------
175 Running the binary with ``-h`` list all available parameters: ::
180 [-dims XxYyZ] [-geometry box|channel|pipe|blocks[-<block size>]] [-iterations <iterations>] [-lattice-dump-ascii]
181 [-rho-in <density>] [-rho-out <density] [-omega <omega>] [-kernel <kernel>]
183 [-t <number of threads>]
186 -- <kernel specific parameters>
188 -list List available kernels.
190 -dims XxYxZ Specify geometry dimensions.
192 -geometry blocks-<block size>
193 Geometetry with blocks of size <block size> regularily layout out.
196 If an option is specified multiple times the last one overrides previous ones.
197 This holds also true for ``-verify`` which sets geometry dimensions,
198 iterations, etc, which can afterward be override, e.g.: ::
200 $ bin/lbmbenchk-linux-intel-release -verfiy -dims 32x32x32
202 Kernel specific parameters can be obtained via selecting the specific kernel
203 and passing ``-h`` as parameter: ::
205 $ bin/lbmbenchk-linux-intel-release -kernel kernel-name -- -h
208 [-blk <n>] [-blk-[xyz] <n>]
211 A list of all available kernels can be obtained via ``-list``: ::
213 $ ../bin/lbmbenchk-linux-gcc-debug -list
214 Lattice Boltzmann Benchmark Kernels (LbmBenchKernels) Copyright (C) 2016, 2017 LSS, RRZE
215 This program comes with ABSOLUTELY NO WARRANTY; for details see LICENSE.
216 This is free software, and you are welcome to redistribute it under certain conditions.
218 LBM Benchmark Kernels 0.1, compiled Jul 5 2017 21:59:22, type: verification
219 Available kernels to benchmark:
224 list-pull-split-nt-1s-soa
225 list-pull-split-nt-2s-soa
242 The following list shortly describes available kernels:
244 - push-soa/push-aos/pull-soa/pull-aos:
245 Unoptimized kernels (but stream/collide are already fused) using two grids as
246 source and destination. Implement push/pull semantics as well structure of
247 arrays (soa) or array of structures (aos) layout.
249 - blk-push-soa/blk-push-aos/blk-pull-soa/blk-pull-aos:
250 The same as the unoptimized kernels without the blk prefix, except that they support
251 spatial blocking, i.e. loop blocking of the three loops used to iterate over
252 the lattice. Here manual work sharing for OpenMP is used.
254 - list-push-soa/list-push-aos/list-pull-soa/list-pull-aos:
255 The same as the unoptimized kernels without the list prefix, but for indirect addressing.
256 Here only a 1D vector of is used to store the fluid nodes, omitting the
257 obstacles. An adjacency list is used to recover the neighborhood associations.
259 - list-pull-split-nt-1s-soa/list-pull-split-nt-2s-soa:
260 Optimized variant of list-pull-soa. Chunks of the lattice are processed as
261 once. Postcollision values are written back via nontemporal stores in 18 (1s)
264 - list-aa-aos/list-aa-soa:
265 Unoptimized implementation of the AA pattern for the 1D vector with adjacency
266 list. Supported are array of structures (aos) and structure of arrays (soa)
267 data layout is supported.
270 Implementation of AA pattern with intrinsics for the 1D vector with adjacency
271 list. Furthermore it contains a vectorized even time step and run length
272 coding to reduce the loop balance of the odd time step.
275 All optimizations of list-aa-ria-soa. Additional with partial vectorization
276 of the odd time step.
279 Note that all array of structures (aos) kernels might require blocking
280 (depending on the domain size) to reach the performance of their structure of
281 arrays (soa) counter parts.
283 The following table summarizes the properties of the kernels. Here **D** means
284 direct addressing, i.e. full array, **I** means indirect addressing, i.e. 1D
285 vector with adjacency list, **x** means supported, whereas **--** means unsupported.
286 The loop balance B_l is computed for D3Q19 model with double precision floating
287 point for PDFs (8 byte) and 4 byte integers for the index (adjacency list).
288 As list-aa-ria-soa and list-aa-pv-soa support run length coding their effective
289 loop balance depends on the geometry. The effective loop balance is printed
293 ====================== =========== =========== ===== ======== ======== ============
294 kernel name prop. step data layout addr. parallel blocking B_l [B/FLUP]
295 ====================== =========== =========== ===== ======== ======== ============
296 push-soa OS SoA D x -- 456
297 push-aos OS AoS D x -- 456
298 pull-soa OS SoA D x -- 456
299 pull-aos OS AoS D x -- 456
300 blk-push-soa OS SoA D x x 456
301 blk-push-aos OS AoS D x x 456
302 blk-pull-soa OS SoA D x x 456
303 blk-pull-aos OS AoS D x x 456
304 list-push-soa OS SoA I x x 528
305 list-push-aos OS AoS I x x 528
306 list-pull-soa OS SoA I x x 528
307 list-pull-aos OS AoS I x x 528
308 list-pull-split-nt-1s OS SoA I x x 376
309 list-pull-split-nt-2s OS SoA I x x 376
310 list-aa-soa AA SoA I x x 340
311 list-aa-aos AA AoS I x x 340
312 list-aa-ria-soa AA SoA I x x 304-342
313 list-aa-pv-soa AA SoA I x x 304-342
314 ====================== =========== =========== ===== ======== ======== ============
319 Correct benchmarking is a nontrivial task. Whenever benchmark results should be
320 created make sure the binary was compiled with:
322 - ``BENCHMARK=on`` (default if not overriden) and
323 - ``BUILD=release`` (default if not overriden) and
324 - the correct ISA for macros is used, selected via ``ISA`` and
325 - use ``TARCH`` to specify the architecture the compiler generates code for.
330 For the Intel compiler one can specify depending on the target ISA extension:
332 - AVX: ``TARCH=-xAVX``
333 - AVX2 and FMA: ``TARCH=-xCORE-AVX2,-fma``
334 - AVX512: ``TARCH=-xCORE-AVX512``
335 - KNL: ``TARCH=-xMIC-AVX512``
337 Compiling for an architecture supporting AVX (Sandy Bridge, Ivy Bridge): ::
339 make ISA=avx TARCH=-xAVX
342 Compiling for an architecture supporting AVX2 (Haswell, Broadwell): ::
344 make ISA=avx TARCH=-xCORE-AVX2,-fma
346 WARNING: ISA is here still set to ``avx`` as currently we have the FMA intrinsics not
347 implemented. This might change in the future.
350 Compiling for an architecture supporting AVX-512 (Skylake): ::
352 make ISA=avx TARCH=-xCORE-AVX512
354 WARNING: ISA is here still set to ``avx`` as currently we have no implementation for the
355 AVX512 intrinsics. This might change in the future.
361 During benchmarking pinning should be used via the ``-pin`` parameter. Running
362 a benchmark with 10 threads and pin them to the first 10 cores works like ::
364 $ bin/lbmbenchk-linux-intel-release ... -t 10 -pin $(seq -s , 0 9)
370 Things the binary does nor check or control:
372 - transparent huge pages: when allocating memory small 4 KiB pages might be
373 replaced with larger ones. This is in general a good thing, but if this is
374 really the case, depends on the system settings (check e.g. the status of
375 ``/sys/kernel/mm/transparent_hugepage/enabled``).
376 Currently ``madvise(MADV_HUGEPAGE)`` is used for allocations which are aligned to
377 a 4 KiB page, which should be the case for the lattices.
378 This should result in huge pages except THP is disabled on the machine.
379 (NOTE: madvise() is used if ``HAVE_HUGE_PAGES`` is defined, which is currently
380 hard coded defined in ``Memory.c``).
382 - CPU/core frequency: For reproducible results the frequency of all cores
385 - NUMA placement policy: The benchmark assumes a first touch policy, which
386 means the memory will be placed at the NUMA domain the touching core is
387 associated with. If a different policy is in place or the NUMA domain to be
388 used is already full memory might be allocated in a remote domain. Accesses
389 to remote domains typically have a higher latency and lower bandwidth.
391 - System load: interference with other application, especially on desktop
392 systems should be avoided.
394 - Padding: For SoA based kernels the number of (fluid) nodes is automatically
395 adjusted so that no cache or TLB thrashing should occur. The parameters are
396 optimized for current Intel based systems. For more details look into the
399 - CPU dispatcher function: the compiler might add different versions of a
400 function for different ISA extensions. Make sure the code you might think is
401 executed is actually the code which is executed.
406 With correct padding cache and TLB thrashing can be avoided. Therefore the
407 number of (fluid) nodes used in the data layout is artificially increased.
409 Currently automatic padding is active for kernels which support it. It can be
410 controlled via the kernel parameter (i.e. parameter after the ``--``)
411 ``-pad``. Supported values are ``auto`` (default), ``no`` (to disable padding),
414 Automatic padding tries to avoid cache and TLB thrashing and pads for a 32
415 entry (huge pages) TLB with 8 sets and a 512 set (L2) cache. This reflects the
416 parameters of current Intel based processors.
418 Manual padding is done via a padding string and has the format
419 ``mod_1+offset_1(,mod_n+offset_n)``, which specifies numbers of bytes.
420 SoA data layouts can exhibit TLB thrashing. Therefore we want to distribute the
421 19 pages with one lattice (36 with two lattices) we are concurrently accessing
422 over as much sets in the TLB as possible.
423 This is controlled by the distance between the accessed pages, which is the
424 number of (fluid) nodes in between them and can be adjusted by adding further
426 We want the distance d (in bytes) between two accessed pages to be e.g.
427 **d % (PAGE_SIZE * TLB_SETS) = PAGE_SIZE**.
428 This would distribute the pages evenly over the sets. Hereby **PAGE_SIZE * TLB_SETS**
429 would be our ``mod_1`` and **PAGE_SIZE** (after the =) our ``offset_1``.
430 Measurements show that with only a quarter of half of a page size as offset
431 higher performance is achieved, which is done by automatic padding.
432 On top of this padding more paddings can be added. They are just added to the
433 padding string and are separated by commas.
435 A zero modulus in the padding string has a special meaning. Here the
436 corresponding offset is just added to the number of nodes. A padding string
437 like ``-pad 0+16`` would at a static padding of two nodes (one node = 8 b).
443 TODO: supported geometries: channel, pipe, blocks, fluid
449 The sections lists performance values measured on several machines for
450 different kernels and geometries.
451 The **RFM** column denotes the expected performance as predicted by the
452 Roofline performance model [williams-2008]_.
453 For performance prediction of each kernel a memory bandwidth benchmark is used
454 which mimics the kernels memory access pattern and the kernel's loop balance
455 (see [kernels]_ for details).
457 Haswell, Intel Xeon E5-2695 v3
458 ------------------------------
460 - Haswell architecture, AVX2, FMA
462 - 2 x 7 cores in cluster-on-die (CoD) mode enabled
468 - copy-19-nt-sl 47.1 GB/s
469 - update-19 44.0 GB/s
471 geometry dimensions: 500x100x100
473 ========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= =====
474 kernel pipe blocks-2 blocks-4 blocks-6 blocks-8 blocks-10 blocks-15 blocks-16 blocks-20 blocks-25 blocks-32 RFM
475 ========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= =====
476 blk-push-aos 58.82 49.85 57.34 59.90 61.37 62.17 65.30 64.00 67.54 64.46 69.69 104
477 blk-push-soa 32.32 33.46 34.02 34.64 35.06 35.04 36.31 35.44 37.20 35.14 37.95 104
478 blk-pull-aos 56.97 51.41 56.09 57.92 59.98 59.83 63.37 61.55 65.50 63.11 67.02 104
479 blk-pull-soa 49.29 46.23 47.50 51.97 51.27 49.52 55.23 53.13 54.50 49.79 57.90 104
480 aa-aos 91.35 66.14 76.80 84.76 83.63 91.36 93.46 92.62 93.91 92.25 92.93 145
481 aa-soa 75.51 65.68 70.94 71.36 73.83 75.46 74.84 79.48 83.28 77.70 82.72 145
482 aa-vec-soa 93.85 83.44 91.58 93.96 94.35 96.62 101.76 96.72 106.37 102.60 110.28 145
483 list-push-aos 80.29 80.97 80.95 81.10 81.37 82.44 81.77 81.49 80.72 81.93 80.93 83
484 list-push-soa 47.52 42.65 45.28 46.64 43.46 40.59 44.94 46.55 41.53 45.98 44.86 83
485 list-pull-aos 85.30 82.97 86.43 83.42 86.33 83.70 86.43 83.77 83.10 85.89 84.44 83
486 list-pull-soa 62.12 63.61 63.28 61.32 66.72 62.65 64.82 60.49 58.01 64.46 62.52 83
487 list-pull-split-nt-1s-soa 121.35 113.77 115.29 113.54 117.00 116.46 114.78 114.54 110.83 112.67 117.85 125
488 list-pull-split-nt-2s-soa 118.09 110.48 112.55 113.18 113.44 111.85 109.27 114.41 110.28 111.78 113.74 125
489 list-aa-aos 121.28 118.63 119.00 118.50 121.99 119.11 118.83 121.47 121.62 126.18 120.12 129
490 list-aa-soa 126.34 116.90 129.45 127.12 129.41 121.42 126.19 126.76 126.70 124.40 125.22 129
491 list-aa-ria-soa 133.68 121.82 126.04 128.46 131.15 132.25 128.78 133.50 126.69 124.40 130.37 145
492 list-aa-pv-soa 146.22 124.39 130.73 136.29 137.61 131.21 138.65 138.78 127.02 132.40 138.37 145
493 ========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= =====
496 Broadwell, Intel Xeon E5-2630 v4
497 --------------------------------
499 - Broadwell architecture, AVX2, FMA
506 - copy-nt-sl-19 48.2 GB/s
507 - update-19 51.1 GB/s
509 geometry dimensions: 500x100x100
511 ========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= =======
512 kernel pipe blocks-2 blocks-4 blocks-6 blocks-8 blocks-10 blocks-15 blocks-16 blocks-20 blocks-25 blocks-32 RFM
513 ========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= =======
514 blk-push-aos 55.75 47.62 54.57 57.10 58.49 59.00 61.72 60.56 64.05 61.10 66.03 105
515 blk-push-soa 30.06 31.09 32.13 32.54 32.74 32.72 33.81 33.19 34.90 33.21 35.75 105
516 blk-pull-aos 53.80 48.61 53.08 54.99 56.08 56.68 59.20 58.12 61.49 58.71 63.45 105
517 blk-pull-soa 46.96 46.61 48.84 49.70 50.33 50.46 52.36 51.39 54.20 51.61 55.71 105
518 aa-aos 91.40 66.99 78.47 83.38 86.62 88.62 92.98 91.54 97.08 94.93 98.90 168
519 aa-soa 83.01 69.96 75.85 77.72 79.01 79.29 82.38 80.11 85.70 83.91 87.69 168
520 aa-vec-soa 112.03 96.52 105.32 109.76 112.55 113.82 120.55 118.37 126.30 121.37 131.94 168
521 list-push-aos 75.13 74.18 75.20 75.42 75.24 75.99 75.80 75.80 75.54 76.22 76.21 97
522 list-push-soa 40.99 38.14 39.00 38.89 38.89 39.67 39.87 39.28 39.35 40.08 40.13 97
523 list-pull-aos 82.07 82.88 83.29 83.09 83.32 83.49 82.82 82.88 83.32 82.60 82.93 97
524 list-pull-soa 62.07 60.40 61.89 61.39 62.43 60.90 60.48 62.80 62.50 61.10 60.38 97
525 list-pull-split-nt-1s-soa 125.81 120.60 121.96 122.34 122.86 123.53 123.64 123.67 125.94 124.09 123.69 128
526 list-pull-split-nt-2s-soa 122.79 117.16 118.86 119.16 119.56 119.99 120.01 120.03 122.64 120.57 120.39 128
527 list-aa-aos 128.13 127.41 129.31 129.07 129.79 129.63 129.67 129.94 129.12 128.41 129.72 150
528 list-aa-soa 141.60 139.78 141.58 142.16 141.94 141.31 142.37 142.25 142.43 141.40 142.26 150
529 list-aa-ria-soa 141.82 134.88 140.15 140.72 141.67 140.51 141.18 141.29 142.97 141.94 143.25 168
530 list-aa-pv-soa 164.79 140.95 159.24 161.78 162.40 163.04 164.69 164.38 165.11 165.75 166.09 168
531 ========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= =======
534 Skylake, Intel Xeon Gold 6148
535 -----------------------------
537 - Skylake architecture, AVX2, FMA, AVX512
544 - copy-19-nt-sl 92.4 GB/s
545 - update-19 93.6 GB/s
547 geometry dimensions: 500x100x100
550 ========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ===
551 kernel pipe blocks-2 blocks-4 blocks-6 blocks-8 blocks-10 blocks-15 blocks-16 blocks-20 blocks-25 blocks-32 RFM
552 ========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ===
553 blk-push-aos 113.01 93.99 108.98 114.65 117.87 119.47 124.95 122.46 129.29 123.87 133.01 197
554 blk-push-soa 100.21 98.87 103.63 105.56 107.02 107.27 111.61 109.83 116.16 110.51 110.29 197
555 blk-pull-aos 118.45 102.54 114.12 117.82 122.69 124.31 130.58 127.85 135.72 129.65 139.94 197
556 blk-pull-soa 82.60 83.36 87.13 88.39 88.84 88.96 92.48 90.93 95.79 91.92 98.64 197
557 aa-aos 171.32 125.43 147.73 157.70 163.35 167.25 175.39 174.20 182.54 173.67 187.76 308
558 aa-soa 180.85 152.39 165.84 152.59 171.90 175.76 184.94 182.34 189.43 180.30 193.54 308
559 aa-vec-soa 208.03 181.51 195.86 203.41 209.08 212.34 224.05 219.49 234.31 225.92 245.22 308
560 list-push-aos 158.81 164.67 162.93 163.05 165.22 164.31 164.66 160.78 164.07 165.19 164.06 177
561 list-push-soa 134.60 110.44 110.17 132.01 132.95 133.46 134.37 134.33 135.12 134.91 137.87 177
562 list-pull-aos 169.61 170.03 170.89 170.90 171.20 171.60 172.09 171.95 169.48 172.08 171.02 177
563 list-pull-soa 120.50 116.73 118.62 118.00 120.99 118.15 117.17 121.41 120.83 120.00 118.74 177
564 list-pull-split-nt-1s-soa 225.59 224.18 225.10 226.34 226.01 230.37 227.50 228.42 227.39 231.65 227.35 246
565 list-pull-split-nt-2s-soa 219.20 214.63 217.61 218.13 219.07 221.01 219.88 220.09 220.62 221.68 220.58 246
566 list-aa-aos 241.39 239.27 239.53 242.56 242.46 243.00 242.91 242.46 241.24 242.96 241.52 275
567 list-aa-soa 273.73 268.49 268.48 271.79 275.29 274.56 277.18 272.67 274.21 275.24 278.21 275
568 list-aa-ria-soa 288.42 261.89 273.26 284.84 283.88 288.29 290.72 289.81 293.36 290.75 292.93 308
569 list-aa-pv-soa 303.35 267.21 289.18 294.96 294.36 298.16 300.45 301.71 302.37 302.88 304.46 308
570 ========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ===
575 The Lattice Boltzmann Benchmark Kernels are licensed under GPLv3.
581 This work was funded by BMBF, grant no. 01IH15003A (project SKAMPY).
583 This work was funded by KONWHIR project OMI4PAPS.
590 I. Ginzburg, F. Verhaeghe, and D. d'Humières.
591 Two-relaxation-time lattice Boltzmann scheme: About parametrization, velocity, pressure and mixed boundary conditions.
592 Commun. Comput. Phys., 3(2):427-478, 2008.
595 S. Williams, A. Waterman, and D. Patterson.
596 Roofline: an insightful visual performance model for multicore architectures.
597 Commun. ACM, 52(4):65-76, Apr 2009. doi:10.1145/1498765.1498785
600 .. |datetime| date:: %Y-%m-%d %H:%M
602 Document was generated at |datetime|.