Commit | Line | Data |
---|---|---|
10988083 MW |
1 | .. # -------------------------------------------------------------------------- |
2 | # | |
3 | # Copyright | |
4 | # Markus Wittmann, 2016-2017 | |
5 | # RRZE, University of Erlangen-Nuremberg, Germany | |
6 | # markus.wittmann -at- fau.de or hpc -at- rrze.fau.de | |
7 | # | |
8 | # Viktor Haag, 2016 | |
9 | # LSS, University of Erlangen-Nuremberg, Germany | |
10 | # | |
11 | # This file is part of the Lattice Boltzmann Benchmark Kernels (LbmBenchKernels). | |
12 | # | |
13 | # LbmBenchKernels is free software: you can redistribute it and/or modify | |
14 | # it under the terms of the GNU General Public License as published by | |
15 | # the Free Software Foundation, either version 3 of the License, or | |
16 | # (at your option) any later version. | |
17 | # | |
18 | # LbmBenchKernels is distributed in the hope that it will be useful, | |
19 | # but WITHOUT ANY WARRANTY; without even the implied warranty of | |
20 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the | |
21 | # GNU General Public License for more details. | |
22 | # | |
23 | # You should have received a copy of the GNU General Public License | |
24 | # along with LbmBenchKernels. If not, see <http://www.gnu.org/licenses/>. | |
25 | # | |
26 | # -------------------------------------------------------------------------- | |
27 | ||
28 | .. title:: LBM Benchmark Kernels Documentation | |
29 | ||
30 | ||
31 | =================================== | |
32 | LBM Benchmark Kernels Documentation | |
33 | =================================== | |
34 | ||
35 | .. sectnum:: | |
36 | .. contents:: | |
37 | ||
38 | Compilation | |
39 | =========== | |
40 | ||
41 | The benchmark framework currently supports only Linux systems and the GCC and | |
42 | Intel compilers. Every other configuration probably requires adjustment inside | |
43 | the code and the makefiles. Further some code might be platform or at least | |
44 | POSIX specific. | |
45 | ||
46 | The benchmark can be build via ``make`` from the ``src`` subdirectory. This will | |
47 | generate one binary which hosts all implemented benchmark kernels. | |
48 | ||
49 | Binaries are located under the ``bin`` subdirectory and will have different names | |
50 | depending on compiler and build configuration. | |
51 | ||
52 | Debug and Verification | |
53 | ---------------------- | |
54 | ||
55 | :: | |
56 | ||
e3f82424 | 57 | make BUILD=debug BENCHMARK=off |
10988083 | 58 | |
e3f82424 | 59 | Running ``make`` with ``BUILD=debug`` builds the debug version of |
10988083 MW |
60 | the benchmark kernels, where no optimizations are performed, line numbers and |
61 | debug symbols are included as well as ``DEBUG`` will be defined. The resulting | |
62 | binary will be found in the ``bin`` subdirectory and named | |
63 | ``lbmbenchk-linux-<compiler>-debug``. | |
64 | ||
e3f82424 MW |
65 | Specifying ``BENCHMARK=off`` turns on verification |
66 | (``VERIFICATION=on``), statistics (``STATISTICS=on``), and VTK output | |
10988083 MW |
67 | (``VTK_OUTPUT=on``) enabled. |
68 | ||
69 | Please note that the generated binary will therefore | |
70 | exhibit a poor performance. | |
71 | ||
72 | Benchmarking | |
73 | ------------ | |
74 | ||
75 | To generate a binary for benchmarking run make with :: | |
76 | ||
e3f82424 | 77 | make |
10988083 | 78 | |
e3f82424 MW |
79 | As default ``BENCHMARK=on`` and ``BUILD=release`` is set, where |
80 | BUILD=release turns optimizations on and ``BENCHMARK=on`` disables | |
10988083 MW |
81 | verfification, statistics, and VTK output. |
82 | ||
83 | Release and Verification | |
84 | ------------------------ | |
85 | ||
86 | Verification with the debug builds can be extremely slow. Hence verification | |
87 | capabilities can be build with release builds: :: | |
88 | ||
e3f82424 | 89 | make BENCHMARK=off |
10988083 MW |
90 | |
91 | Compilers | |
92 | --------- | |
93 | ||
94 | Currently only the GCC and Intel compiler under Linux are supported. Between | |
95 | both configuration can be chosen via ``CONFIG=linux-gcc`` or | |
96 | ``CONFIG=linux-intel``. | |
97 | ||
e3f82424 MW |
98 | |
99 | Cleaning | |
100 | -------- | |
101 | ||
102 | For each configuration and build (debug/release) a subdirectory under the | |
103 | ``src/obj`` directory is created where the dependency and object files are | |
104 | stored. | |
105 | With :: | |
106 | ||
107 | make CONFIG=... BUILD=... clean | |
108 | ||
109 | a specific combination is select and cleaned, whereas with :: | |
110 | ||
111 | make clean-all | |
112 | ||
113 | all object and dependency files are deleted. | |
114 | ||
115 | ||
10988083 MW |
116 | Options Summary |
117 | --------------- | |
118 | ||
119 | Options that can be specified when building the framework with make: | |
120 | ||
121 | ============= ======================= ============ ========================================================== | |
122 | name values default description | |
123 | ------------- ----------------------- ------------ ---------------------------------------------------------- | |
e3f82424 MW |
124 | BENCHMARK on, off on If enabled, disables VERIFICATION, STATISTICS, VTK_OUTPUT. If disabled enables the three former options. |
125 | BUILD debug, release release No optimization, debug symbols, DEBUG defined. | |
10988083 MW |
126 | CONFIG linux-gcc, linux-intel linux-intel Select GCC or Intel compiler. |
127 | ISA avx, sse avx Determines which ISA extension is used for macro definitions. This is *not* the architecture the compiler generates code for. | |
128 | OPENMP on, off on OpenMP, i.\,e.\. threading support. | |
129 | STATISTICS on, off off View statistics, like density etc, during simulation. | |
e3f82424 | 130 | TARCH -- -- Via TARCH the architecture the compiler generates code for can be overridden. The value depends on the chosen compiler. |
10988083 MW |
131 | VERIFICATION on, off off Turn verification on/off. |
132 | VTK_OUTPUT on, off off Enable/Disable VTK file output. | |
133 | ============= ======================= ============ ========================================================== | |
134 | ||
135 | Invocation | |
136 | ========== | |
137 | ||
e3f82424 | 138 | Running the binary will print among the GPL licence header a line like the following: :: |
10988083 MW |
139 | |
140 | LBM Benchmark Kernels 0.1, compiled Jul 5 2017 21:59:22, type: verification | |
141 | ||
e3f82424 | 142 | if verfication was enabled during compilation or :: |
10988083 MW |
143 | |
144 | LBM Benchmark Kernels 0.1, compiled Jul 5 2017 21:59:22, type: benchmark | |
145 | ||
146 | if verfication was disabled during compilation. | |
147 | ||
148 | Command Line Parameters | |
149 | ----------------------- | |
150 | ||
151 | Running the binary with ``-h`` list all available parameters: :: | |
152 | ||
153 | Usage: | |
154 | ./lbmbenchk -list | |
155 | ./lbmbenchk | |
156 | [-dims XxYyZ] [-geometry box|channel|pipe|blocks[-<block size>]] [-iterations <iterations>] [-lattice-dump-ascii] | |
157 | [-rho-in <density>] [-rho-out <density] [-omega <omega>] [-kernel <kernel>] | |
158 | [-periodic-x] | |
159 | [-t <number of threads>] | |
160 | [-pin core{,core}*] | |
161 | [-verify] | |
162 | -- <kernel specific parameters> | |
163 | ||
164 | -list List available kernels. | |
165 | ||
166 | -dims XxYxZ Specify geometry dimensions. | |
167 | ||
168 | -geometry blocks-<block size> | |
169 | Geometetry with blocks of size <block size> regularily layout out. | |
170 | ||
171 | ||
172 | If an option is specified multiple times the last one overrides previous ones. | |
173 | This holds also true for ``-verify`` which sets geometry dimensions, | |
174 | iterations, etc, which can afterward be override, e.g.: :: | |
175 | ||
176 | $ bin/lbmbenchk-linux-intel-release -verfiy -dims 32x32x32 | |
177 | ||
178 | Kernel specific parameters can be opatained via selecting the specific kernel | |
179 | and passing ``-h`` as parameter: :: | |
180 | ||
e3f82424 | 181 | $ bin/lbmbenchk-linux-intel-release -kernel kernel-name -- -h |
10988083 MW |
182 | ... |
183 | Kernel parameters: | |
184 | [-blk <n>] [-blk-[xyz] <n>] | |
185 | ||
186 | ||
187 | A list of all available kernels can be obtained via ``-list``: :: | |
188 | ||
189 | $ ../bin/lbmbenchk-linux-gcc-debug -list | |
190 | Lattice Boltzmann Benchmark Kernels (LbmBenchKernels) Copyright (C) 2016, 2017 LSS, RRZE | |
191 | This program comes with ABSOLUTELY NO WARRANTY; for details see LICENSE. | |
192 | This is free software, and you are welcome to redistribute it under certain conditions. | |
193 | ||
194 | LBM Benchmark Kernels 0.1, compiled Jul 5 2017 21:59:22, type: verification | |
195 | Available kernels to benchmark: | |
196 | list-aa-pv-soa | |
197 | list-aa-ria-soa | |
198 | list-aa-soa | |
199 | list-aa-aos | |
200 | list-pull-split-nt-1s-soa | |
201 | list-pull-split-nt-2s-soa | |
202 | list-push-soa | |
203 | list-push-aos | |
204 | list-pull-soa | |
205 | list-pull-aos | |
206 | push-soa | |
207 | push-aos | |
208 | pull-soa | |
209 | pull-aos | |
210 | blk-push-soa | |
211 | blk-push-aos | |
212 | blk-pull-soa | |
213 | blk-pull-aos | |
214 | ||
e3f82424 MW |
215 | Kernels |
216 | ------- | |
217 | ||
218 | The following list shortly describes available kernels: | |
219 | ||
220 | - push-soa/push-aos/pull-soa/pull-aos: | |
221 | Unoptimized kernels (but stream/collide are already fused) using two grids as | |
222 | source and destination. Implement push/pull semantics as well structure of | |
223 | arrays (soa) or array of structures (aos) layout. | |
224 | ||
225 | - blk-push-soa/blk-push-aos/blk-pull-soa/blk-pull-aos: | |
226 | The same as the unoptimized kernels without the blk prefix, except that they support | |
227 | spatial blocking, i.e. loop blocking of the three loops used to iterate over | |
228 | the lattice. Here manual work sharing for OpenMP is used. | |
229 | ||
230 | - list-push-soa/list-push-aos/list-pull-soa/list-pull-aos: | |
231 | The same as the unoptimized kernels without the list prefix, but for indirect addressing. | |
232 | Here only a 1D vector of is used to store the fluid nodes, omitting the | |
233 | obstacles. An adjacency list is used to recover the neighborhood associations. | |
234 | ||
235 | - list-pull-split-nt-1s-soa/list-pull-split-nt-2s-soa: | |
236 | Optimized variant of list-pull-soa. Chunks of the lattice are processed as | |
237 | once. Postcollision values are written back via nontemporal stores in 18 (1s) | |
238 | or 9 (2s) loops. | |
239 | ||
240 | - list-aa-aos/list-aa-soa: | |
241 | Unoptimized implementation of the AA pattern for the 1D vector with adjacency | |
242 | list. Supported are array of structures (aos) and structure of arrays (soa) | |
243 | data layout is supported. | |
244 | ||
245 | - list-aa-ria-soa: | |
246 | Implementation of AA pattern with intrinsics for the 1D vector with adjacency | |
247 | list. Furthermore it contains a vectorized even time step and run length | |
248 | coding to reduce the loop balance of the odd time step. | |
249 | ||
250 | - list-aa-pv-soa: | |
251 | All optimizations of list-aa-ria-soa. Additional with partial vectorization | |
252 | of the odd time step. | |
253 | ||
254 | ||
255 | Note that all array of structures (aos) kernels might require blocking | |
256 | (depending on the domain size) to reach the performance of their structure of | |
257 | arrays (soa) counter parts. | |
258 | ||
259 | The following table summarizes the properties of the kernels. Here **D** means | |
260 | direct addressing, i.e. full array, **I** means indirect addressing, i.e. 1D | |
261 | vector with adjacency list, **x** means supported, whereas **--** means unsupported. | |
262 | The loop balance B_l is computed for D3Q19 model with double precision floating | |
263 | point for PDFs (8 byte) and 4 byte integers for the index (adjacency list). | |
264 | As list-aa-ria-soa and list-aa-pv-soa support run length coding their effective | |
265 | loop balance depends on the geometry. The effective loop balance is printed | |
266 | during each run. | |
267 | ||
268 | ||
269 | ====================== =========== =========== ===== ======== ======== ============ | |
270 | kernel name prop. step data layout addr. parallel blocking B_l [B/FLUP] | |
271 | ====================== =========== =========== ===== ======== ======== ============ | |
272 | push-soa OS SoA D x -- 456 | |
273 | push-aos OS AoS D x -- 456 | |
274 | pull-soa OS SoA D x -- 456 | |
275 | pull-aos OS AoS D x -- 456 | |
276 | blk-push-soa OS SoA D x x 456 | |
277 | blk-push-aos OS AoS D x x 456 | |
278 | blk-pull-soa OS SoA D x x 456 | |
279 | blk-pull-aos OS AoS D x x 456 | |
280 | list-push-soa OS SoA I x x 528 | |
281 | list-push-aos OS AoS I x x 528 | |
282 | list-pull-soa OS SoA I x x 528 | |
283 | list-pull-aos OS AoS I x x 528 | |
284 | list-pull-split-nt-1s OS SoA I x x 376 | |
285 | list-pull-split-nt-2s OS SoA I x x 376 | |
286 | list-aa-soa AA SoA I x x 340 | |
287 | list-aa-aos AA AoS I x x 340 | |
288 | list-aa-ria-soa AA SoA I x x 304-342 | |
289 | list-aa-pv-soa AA SoA I x x 304-342 | |
290 | ====================== =========== =========== ===== ======== ======== ============ | |
10988083 MW |
291 | |
292 | Benchmarking | |
293 | ============ | |
294 | ||
295 | Correct benchmarking is a nontrivial task. Whenever benchmark results should be | |
296 | created make sure the binary was compiled with: | |
297 | ||
e3f82424 MW |
298 | - ``BENCHMARK=on`` (default if not overriden) and |
299 | - ``BUILD=release`` (default if not overriden) and | |
10988083 MW |
300 | - the correct ISA for macros is used, selected via ``ISA`` and |
301 | - use ``TARCH`` to specify the architecture the compiler generates code for. | |
302 | ||
303 | During benchmarking pinning should be used via the ``-pin`` parameter. Running | |
304 | a benchmark with 10 threads an pin them to the first 10 cores works like :: | |
305 | ||
306 | $ bin/lbmbenchk-linux-intel-release ... -t 10 -pin $(seq -s , 0 9) | |
307 | ||
308 | Things the binary does nor check or controll: | |
309 | ||
310 | - transparent huge pages: when allocating memory small 4 KiB pages might be | |
311 | replaced with larger ones. This is in general a good thing, but if this is | |
e3f82424 MW |
312 | really the case, depends on the system settings (check e.g. the status of |
313 | ``/sys/kernel/mm/transparent_hugepage/enabled``). | |
314 | Currently ``madvise(MADV_HUGEPAGE)`` is used for allocations which are aligned to | |
315 | a 4 KiB page, which should be the case for the lattices. | |
316 | This should result in huge pages except THP is disabled on the machine. | |
317 | (NOTE: madvise() is used if ``HAVE_HUGE_PAGES`` is defined, which is currently | |
318 | hard coded defined in ``Memory.c``). | |
10988083 MW |
319 | |
320 | - CPU/core frequency: For reproducible results the frequency of all cores | |
321 | should be fixed. | |
322 | ||
323 | - NUMA placement policy: The benchmark assumes a first touch policy, which | |
324 | means the memory will be placed at the NUMA domain the touching core is | |
325 | associated with. If a different policy is in place or the NUMA domain to be | |
326 | used is already full memory might be allocated in a remote domain. Accesses | |
327 | to remote domains typically have a higher latency and lower bandwidth. | |
328 | ||
329 | - System load: interference with other application, espcially on desktop | |
330 | systems should be avoided. | |
331 | ||
e3f82424 MW |
332 | - Padding: For SoA based kernels the number of (fluid) nodes is automatically |
333 | adjusted so that no cache or TLB thrashing should occur. The parameters are | |
334 | optimized for current Intel based systems. For more details look into the | |
335 | padding section. | |
10988083 MW |
336 | |
337 | - CPU dispatcher function: the compiler might add different versions of a | |
338 | function for different ISA extensions. Make sure the code you might think is | |
339 | executed is actually the code which is executed. | |
340 | ||
e3f82424 MW |
341 | Padding |
342 | ------- | |
343 | ||
344 | With correct padding cache and TLB thrashing can be avoided. Therefore the | |
345 | number of (fluid) nodes used in the data layout is artificially increased. | |
346 | ||
347 | Currently automatic padding is active for kernels which support it. It can be | |
348 | controlled via the kernel parameter (i.e. parameter after the ``--``) | |
349 | ``-pad``. Supported values are ``auto`` (default), ``no`` (to disable padding), | |
350 | or a manual padding. | |
351 | ||
352 | Automatic padding tries to avoid cache and TLB thrashing and pads for a 32 | |
353 | entry (huge pages) TLB with 8 sets and a 512 set (L2) cache. This reflects the | |
354 | parameters of current Intel based processors. | |
355 | ||
356 | Manual padding is done via a padding string and has the format | |
357 | ``mod_1+offset_1(,mod_n+offset_n)``, which specifies numbers of bytes. | |
358 | SoA data layouts can exhibit TLB thrashing. Therefore we want to distribute the | |
359 | 19 pages with one lattice (36 with two lattices) we are concurrently accessing | |
360 | over as much sets in the TLB as possible. | |
361 | This is controlled by the distance between the accessed pages, which is the | |
362 | number of (fluid) nodes in between them and can be adjusted by adding further | |
363 | (fluid) nodes. | |
364 | We want the distance d (in bytes) between two accessed pages to be e.g. | |
365 | **d % (PAGE_SIZE * TLB_SETS) = PAGE_SIZE**. | |
366 | This would distribute the pages evenly over the sets. Hereby **PAGE_SIZE * TLB_SETS** | |
367 | would be our ``mod_1`` and **PAGE_SIZE** (after the =) our ``offset_1``. | |
368 | Measurements show that with only a quarter of half of a page size as offset | |
369 | higher performance is achieved, which is done by automatic padding. | |
370 | On top of this padding more paddings can be added. They are just added to the | |
371 | padding string and are separated by commas. | |
372 | ||
373 | A zero modulus in the padding string has a special meaning. Here the | |
374 | corresponding offset is just added to the number of nodes. A padding string | |
375 | like ``-pad 0+16`` would at a static padding of two nodes (one node = 8 b). | |
376 | ||
377 | ||
378 | Geometries | |
379 | ========== | |
380 | ||
381 | TODO: supported geometries: channel, pipe, blocks | |
382 | ||
383 | ||
384 | Results | |
385 | ======= | |
386 | ||
387 | TODO | |
388 | ||
389 | ||
390 | Licence | |
391 | ======= | |
392 | ||
393 | The Lattice Boltzmann Benchmark Kernels are licensed under GPLv3. | |
394 | ||
4e91c4b6 MW |
395 | |
396 | Acknowledgements | |
397 | ================ | |
398 | ||
399 | This work was funded by BMBF, grant no. 01IH15003A (project SKAMPY). | |
400 | ||
401 | This work was funded by KONWHIR project OMI4PAPS. | |
402 | ||
403 | ||
404 | ||
10988083 MW |
405 | .. |datetime| date:: %Y-%m-%d %H:%M |
406 | ||
407 | Document was generated at |datetime|. | |
408 |