| 1 | .. # -------------------------------------------------------------------------- |
| 2 | # |
| 3 | # Copyright |
| 4 | # Markus Wittmann, 2016-2017 |
| 5 | # RRZE, University of Erlangen-Nuremberg, Germany |
| 6 | # markus.wittmann -at- fau.de or hpc -at- rrze.fau.de |
| 7 | # |
| 8 | # Viktor Haag, 2016 |
| 9 | # LSS, University of Erlangen-Nuremberg, Germany |
| 10 | # |
| 11 | # This file is part of the Lattice Boltzmann Benchmark Kernels (LbmBenchKernels). |
| 12 | # |
| 13 | # LbmBenchKernels is free software: you can redistribute it and/or modify |
| 14 | # it under the terms of the GNU General Public License as published by |
| 15 | # the Free Software Foundation, either version 3 of the License, or |
| 16 | # (at your option) any later version. |
| 17 | # |
| 18 | # LbmBenchKernels is distributed in the hope that it will be useful, |
| 19 | # but WITHOUT ANY WARRANTY; without even the implied warranty of |
| 20 | # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the |
| 21 | # GNU General Public License for more details. |
| 22 | # |
| 23 | # You should have received a copy of the GNU General Public License |
| 24 | # along with LbmBenchKernels. If not, see <http://www.gnu.org/licenses/>. |
| 25 | # |
| 26 | # -------------------------------------------------------------------------- |
| 27 | |
| 28 | .. title:: LBM Benchmark Kernels Documentation |
| 29 | |
| 30 | |
| 31 | =================================== |
| 32 | LBM Benchmark Kernels Documentation |
| 33 | =================================== |
| 34 | |
| 35 | .. sectnum:: |
| 36 | .. contents:: |
| 37 | |
| 38 | Compilation |
| 39 | =========== |
| 40 | |
| 41 | The benchmark framework currently supports only Linux systems and the GCC and |
| 42 | Intel compilers. Every other configuration probably requires adjustment inside |
| 43 | the code and the makefiles. Further some code might be platform or at least |
| 44 | POSIX specific. |
| 45 | |
| 46 | The benchmark can be build via ``make`` from the ``src`` subdirectory. This will |
| 47 | generate one binary which hosts all implemented benchmark kernels. |
| 48 | |
| 49 | Binaries are located under the ``bin`` subdirectory and will have different names |
| 50 | depending on compiler and build configuration. |
| 51 | |
| 52 | Debug and Verification |
| 53 | ---------------------- |
| 54 | |
| 55 | :: |
| 56 | |
| 57 | make BUILD=debug BENCHMARK=off |
| 58 | |
| 59 | Running ``make`` with ``BUILD=debug`` builds the debug version of |
| 60 | the benchmark kernels, where no optimizations are performed, line numbers and |
| 61 | debug symbols are included as well as ``DEBUG`` will be defined. The resulting |
| 62 | binary will be found in the ``bin`` subdirectory and named |
| 63 | ``lbmbenchk-linux-<compiler>-debug``. |
| 64 | |
| 65 | Specifying ``BENCHMARK=off`` turns on verification |
| 66 | (``VERIFICATION=on``), statistics (``STATISTICS=on``), and VTK output |
| 67 | (``VTK_OUTPUT=on``) enabled. |
| 68 | |
| 69 | Please note that the generated binary will therefore |
| 70 | exhibit a poor performance. |
| 71 | |
| 72 | Benchmarking |
| 73 | ------------ |
| 74 | |
| 75 | To generate a binary for benchmarking run make with :: |
| 76 | |
| 77 | make |
| 78 | |
| 79 | As default ``BENCHMARK=on`` and ``BUILD=release`` is set, where |
| 80 | BUILD=release turns optimizations on and ``BENCHMARK=on`` disables |
| 81 | verfification, statistics, and VTK output. |
| 82 | |
| 83 | Release and Verification |
| 84 | ------------------------ |
| 85 | |
| 86 | Verification with the debug builds can be extremely slow. Hence verification |
| 87 | capabilities can be build with release builds: :: |
| 88 | |
| 89 | make BENCHMARK=off |
| 90 | |
| 91 | Compilers |
| 92 | --------- |
| 93 | |
| 94 | Currently only the GCC and Intel compiler under Linux are supported. Between |
| 95 | both configuration can be chosen via ``CONFIG=linux-gcc`` or |
| 96 | ``CONFIG=linux-intel``. |
| 97 | |
| 98 | |
| 99 | Cleaning |
| 100 | -------- |
| 101 | |
| 102 | For each configuration and build (debug/release) a subdirectory under the |
| 103 | ``src/obj`` directory is created where the dependency and object files are |
| 104 | stored. |
| 105 | With :: |
| 106 | |
| 107 | make CONFIG=... BUILD=... clean |
| 108 | |
| 109 | a specific combination is select and cleaned, whereas with :: |
| 110 | |
| 111 | make clean-all |
| 112 | |
| 113 | all object and dependency files are deleted. |
| 114 | |
| 115 | |
| 116 | Options Summary |
| 117 | --------------- |
| 118 | |
| 119 | Options that can be specified when building the framework with make: |
| 120 | |
| 121 | ============= ======================= ============ ========================================================== |
| 122 | name values default description |
| 123 | ------------- ----------------------- ------------ ---------------------------------------------------------- |
| 124 | BENCHMARK on, off on If enabled, disables VERIFICATION, STATISTICS, VTK_OUTPUT. If disabled enables the three former options. |
| 125 | BUILD debug, release release No optimization, debug symbols, DEBUG defined. |
| 126 | CONFIG linux-gcc, linux-intel linux-intel Select GCC or Intel compiler. |
| 127 | ISA avx, sse avx Determines which ISA extension is used for macro definitions. This is *not* the architecture the compiler generates code for. |
| 128 | OPENMP on, off on OpenMP, i.\,e.\. threading support. |
| 129 | STATISTICS on, off off View statistics, like density etc, during simulation. |
| 130 | TARCH -- -- Via TARCH the architecture the compiler generates code for can be overridden. The value depends on the chosen compiler. |
| 131 | VERIFICATION on, off off Turn verification on/off. |
| 132 | VTK_OUTPUT on, off off Enable/Disable VTK file output. |
| 133 | ============= ======================= ============ ========================================================== |
| 134 | |
| 135 | Invocation |
| 136 | ========== |
| 137 | |
| 138 | Running the binary will print among the GPL licence header a line like the following: :: |
| 139 | |
| 140 | LBM Benchmark Kernels 0.1, compiled Jul 5 2017 21:59:22, type: verification |
| 141 | |
| 142 | if verfication was enabled during compilation or :: |
| 143 | |
| 144 | LBM Benchmark Kernels 0.1, compiled Jul 5 2017 21:59:22, type: benchmark |
| 145 | |
| 146 | if verfication was disabled during compilation. |
| 147 | |
| 148 | Command Line Parameters |
| 149 | ----------------------- |
| 150 | |
| 151 | Running the binary with ``-h`` list all available parameters: :: |
| 152 | |
| 153 | Usage: |
| 154 | ./lbmbenchk -list |
| 155 | ./lbmbenchk |
| 156 | [-dims XxYyZ] [-geometry box|channel|pipe|blocks[-<block size>]] [-iterations <iterations>] [-lattice-dump-ascii] |
| 157 | [-rho-in <density>] [-rho-out <density] [-omega <omega>] [-kernel <kernel>] |
| 158 | [-periodic-x] |
| 159 | [-t <number of threads>] |
| 160 | [-pin core{,core}*] |
| 161 | [-verify] |
| 162 | -- <kernel specific parameters> |
| 163 | |
| 164 | -list List available kernels. |
| 165 | |
| 166 | -dims XxYxZ Specify geometry dimensions. |
| 167 | |
| 168 | -geometry blocks-<block size> |
| 169 | Geometetry with blocks of size <block size> regularily layout out. |
| 170 | |
| 171 | |
| 172 | If an option is specified multiple times the last one overrides previous ones. |
| 173 | This holds also true for ``-verify`` which sets geometry dimensions, |
| 174 | iterations, etc, which can afterward be override, e.g.: :: |
| 175 | |
| 176 | $ bin/lbmbenchk-linux-intel-release -verfiy -dims 32x32x32 |
| 177 | |
| 178 | Kernel specific parameters can be opatained via selecting the specific kernel |
| 179 | and passing ``-h`` as parameter: :: |
| 180 | |
| 181 | $ bin/lbmbenchk-linux-intel-release -kernel kernel-name -- -h |
| 182 | ... |
| 183 | Kernel parameters: |
| 184 | [-blk <n>] [-blk-[xyz] <n>] |
| 185 | |
| 186 | |
| 187 | A list of all available kernels can be obtained via ``-list``: :: |
| 188 | |
| 189 | $ ../bin/lbmbenchk-linux-gcc-debug -list |
| 190 | Lattice Boltzmann Benchmark Kernels (LbmBenchKernels) Copyright (C) 2016, 2017 LSS, RRZE |
| 191 | This program comes with ABSOLUTELY NO WARRANTY; for details see LICENSE. |
| 192 | This is free software, and you are welcome to redistribute it under certain conditions. |
| 193 | |
| 194 | LBM Benchmark Kernels 0.1, compiled Jul 5 2017 21:59:22, type: verification |
| 195 | Available kernels to benchmark: |
| 196 | list-aa-pv-soa |
| 197 | list-aa-ria-soa |
| 198 | list-aa-soa |
| 199 | list-aa-aos |
| 200 | list-pull-split-nt-1s-soa |
| 201 | list-pull-split-nt-2s-soa |
| 202 | list-push-soa |
| 203 | list-push-aos |
| 204 | list-pull-soa |
| 205 | list-pull-aos |
| 206 | push-soa |
| 207 | push-aos |
| 208 | pull-soa |
| 209 | pull-aos |
| 210 | blk-push-soa |
| 211 | blk-push-aos |
| 212 | blk-pull-soa |
| 213 | blk-pull-aos |
| 214 | |
| 215 | Kernels |
| 216 | ------- |
| 217 | |
| 218 | The following list shortly describes available kernels: |
| 219 | |
| 220 | - push-soa/push-aos/pull-soa/pull-aos: |
| 221 | Unoptimized kernels (but stream/collide are already fused) using two grids as |
| 222 | source and destination. Implement push/pull semantics as well structure of |
| 223 | arrays (soa) or array of structures (aos) layout. |
| 224 | |
| 225 | - blk-push-soa/blk-push-aos/blk-pull-soa/blk-pull-aos: |
| 226 | The same as the unoptimized kernels without the blk prefix, except that they support |
| 227 | spatial blocking, i.e. loop blocking of the three loops used to iterate over |
| 228 | the lattice. Here manual work sharing for OpenMP is used. |
| 229 | |
| 230 | - list-push-soa/list-push-aos/list-pull-soa/list-pull-aos: |
| 231 | The same as the unoptimized kernels without the list prefix, but for indirect addressing. |
| 232 | Here only a 1D vector of is used to store the fluid nodes, omitting the |
| 233 | obstacles. An adjacency list is used to recover the neighborhood associations. |
| 234 | |
| 235 | - list-pull-split-nt-1s-soa/list-pull-split-nt-2s-soa: |
| 236 | Optimized variant of list-pull-soa. Chunks of the lattice are processed as |
| 237 | once. Postcollision values are written back via nontemporal stores in 18 (1s) |
| 238 | or 9 (2s) loops. |
| 239 | |
| 240 | - list-aa-aos/list-aa-soa: |
| 241 | Unoptimized implementation of the AA pattern for the 1D vector with adjacency |
| 242 | list. Supported are array of structures (aos) and structure of arrays (soa) |
| 243 | data layout is supported. |
| 244 | |
| 245 | - list-aa-ria-soa: |
| 246 | Implementation of AA pattern with intrinsics for the 1D vector with adjacency |
| 247 | list. Furthermore it contains a vectorized even time step and run length |
| 248 | coding to reduce the loop balance of the odd time step. |
| 249 | |
| 250 | - list-aa-pv-soa: |
| 251 | All optimizations of list-aa-ria-soa. Additional with partial vectorization |
| 252 | of the odd time step. |
| 253 | |
| 254 | |
| 255 | Note that all array of structures (aos) kernels might require blocking |
| 256 | (depending on the domain size) to reach the performance of their structure of |
| 257 | arrays (soa) counter parts. |
| 258 | |
| 259 | The following table summarizes the properties of the kernels. Here **D** means |
| 260 | direct addressing, i.e. full array, **I** means indirect addressing, i.e. 1D |
| 261 | vector with adjacency list, **x** means supported, whereas **--** means unsupported. |
| 262 | The loop balance B_l is computed for D3Q19 model with double precision floating |
| 263 | point for PDFs (8 byte) and 4 byte integers for the index (adjacency list). |
| 264 | As list-aa-ria-soa and list-aa-pv-soa support run length coding their effective |
| 265 | loop balance depends on the geometry. The effective loop balance is printed |
| 266 | during each run. |
| 267 | |
| 268 | |
| 269 | ====================== =========== =========== ===== ======== ======== ============ |
| 270 | kernel name prop. step data layout addr. parallel blocking B_l [B/FLUP] |
| 271 | ====================== =========== =========== ===== ======== ======== ============ |
| 272 | push-soa OS SoA D x -- 456 |
| 273 | push-aos OS AoS D x -- 456 |
| 274 | pull-soa OS SoA D x -- 456 |
| 275 | pull-aos OS AoS D x -- 456 |
| 276 | blk-push-soa OS SoA D x x 456 |
| 277 | blk-push-aos OS AoS D x x 456 |
| 278 | blk-pull-soa OS SoA D x x 456 |
| 279 | blk-pull-aos OS AoS D x x 456 |
| 280 | list-push-soa OS SoA I x x 528 |
| 281 | list-push-aos OS AoS I x x 528 |
| 282 | list-pull-soa OS SoA I x x 528 |
| 283 | list-pull-aos OS AoS I x x 528 |
| 284 | list-pull-split-nt-1s OS SoA I x x 376 |
| 285 | list-pull-split-nt-2s OS SoA I x x 376 |
| 286 | list-aa-soa AA SoA I x x 340 |
| 287 | list-aa-aos AA AoS I x x 340 |
| 288 | list-aa-ria-soa AA SoA I x x 304-342 |
| 289 | list-aa-pv-soa AA SoA I x x 304-342 |
| 290 | ====================== =========== =========== ===== ======== ======== ============ |
| 291 | |
| 292 | Benchmarking |
| 293 | ============ |
| 294 | |
| 295 | Correct benchmarking is a nontrivial task. Whenever benchmark results should be |
| 296 | created make sure the binary was compiled with: |
| 297 | |
| 298 | - ``BENCHMARK=on`` (default if not overriden) and |
| 299 | - ``BUILD=release`` (default if not overriden) and |
| 300 | - the correct ISA for macros is used, selected via ``ISA`` and |
| 301 | - use ``TARCH`` to specify the architecture the compiler generates code for. |
| 302 | |
| 303 | During benchmarking pinning should be used via the ``-pin`` parameter. Running |
| 304 | a benchmark with 10 threads an pin them to the first 10 cores works like :: |
| 305 | |
| 306 | $ bin/lbmbenchk-linux-intel-release ... -t 10 -pin $(seq -s , 0 9) |
| 307 | |
| 308 | Things the binary does nor check or controll: |
| 309 | |
| 310 | - transparent huge pages: when allocating memory small 4 KiB pages might be |
| 311 | replaced with larger ones. This is in general a good thing, but if this is |
| 312 | really the case, depends on the system settings (check e.g. the status of |
| 313 | ``/sys/kernel/mm/transparent_hugepage/enabled``). |
| 314 | Currently ``madvise(MADV_HUGEPAGE)`` is used for allocations which are aligned to |
| 315 | a 4 KiB page, which should be the case for the lattices. |
| 316 | This should result in huge pages except THP is disabled on the machine. |
| 317 | (NOTE: madvise() is used if ``HAVE_HUGE_PAGES`` is defined, which is currently |
| 318 | hard coded defined in ``Memory.c``). |
| 319 | |
| 320 | - CPU/core frequency: For reproducible results the frequency of all cores |
| 321 | should be fixed. |
| 322 | |
| 323 | - NUMA placement policy: The benchmark assumes a first touch policy, which |
| 324 | means the memory will be placed at the NUMA domain the touching core is |
| 325 | associated with. If a different policy is in place or the NUMA domain to be |
| 326 | used is already full memory might be allocated in a remote domain. Accesses |
| 327 | to remote domains typically have a higher latency and lower bandwidth. |
| 328 | |
| 329 | - System load: interference with other application, espcially on desktop |
| 330 | systems should be avoided. |
| 331 | |
| 332 | - Padding: For SoA based kernels the number of (fluid) nodes is automatically |
| 333 | adjusted so that no cache or TLB thrashing should occur. The parameters are |
| 334 | optimized for current Intel based systems. For more details look into the |
| 335 | padding section. |
| 336 | |
| 337 | - CPU dispatcher function: the compiler might add different versions of a |
| 338 | function for different ISA extensions. Make sure the code you might think is |
| 339 | executed is actually the code which is executed. |
| 340 | |
| 341 | Padding |
| 342 | ------- |
| 343 | |
| 344 | With correct padding cache and TLB thrashing can be avoided. Therefore the |
| 345 | number of (fluid) nodes used in the data layout is artificially increased. |
| 346 | |
| 347 | Currently automatic padding is active for kernels which support it. It can be |
| 348 | controlled via the kernel parameter (i.e. parameter after the ``--``) |
| 349 | ``-pad``. Supported values are ``auto`` (default), ``no`` (to disable padding), |
| 350 | or a manual padding. |
| 351 | |
| 352 | Automatic padding tries to avoid cache and TLB thrashing and pads for a 32 |
| 353 | entry (huge pages) TLB with 8 sets and a 512 set (L2) cache. This reflects the |
| 354 | parameters of current Intel based processors. |
| 355 | |
| 356 | Manual padding is done via a padding string and has the format |
| 357 | ``mod_1+offset_1(,mod_n+offset_n)``, which specifies numbers of bytes. |
| 358 | SoA data layouts can exhibit TLB thrashing. Therefore we want to distribute the |
| 359 | 19 pages with one lattice (36 with two lattices) we are concurrently accessing |
| 360 | over as much sets in the TLB as possible. |
| 361 | This is controlled by the distance between the accessed pages, which is the |
| 362 | number of (fluid) nodes in between them and can be adjusted by adding further |
| 363 | (fluid) nodes. |
| 364 | We want the distance d (in bytes) between two accessed pages to be e.g. |
| 365 | **d % (PAGE_SIZE * TLB_SETS) = PAGE_SIZE**. |
| 366 | This would distribute the pages evenly over the sets. Hereby **PAGE_SIZE * TLB_SETS** |
| 367 | would be our ``mod_1`` and **PAGE_SIZE** (after the =) our ``offset_1``. |
| 368 | Measurements show that with only a quarter of half of a page size as offset |
| 369 | higher performance is achieved, which is done by automatic padding. |
| 370 | On top of this padding more paddings can be added. They are just added to the |
| 371 | padding string and are separated by commas. |
| 372 | |
| 373 | A zero modulus in the padding string has a special meaning. Here the |
| 374 | corresponding offset is just added to the number of nodes. A padding string |
| 375 | like ``-pad 0+16`` would at a static padding of two nodes (one node = 8 b). |
| 376 | |
| 377 | |
| 378 | Geometries |
| 379 | ========== |
| 380 | |
| 381 | TODO: supported geometries: channel, pipe, blocks |
| 382 | |
| 383 | |
| 384 | Results |
| 385 | ======= |
| 386 | |
| 387 | TODO |
| 388 | |
| 389 | |
| 390 | Licence |
| 391 | ======= |
| 392 | |
| 393 | The Lattice Boltzmann Benchmark Kernels are licensed under GPLv3. |
| 394 | |
| 395 | |
| 396 | Acknowledgements |
| 397 | ================ |
| 398 | |
| 399 | This work was funded by BMBF, grant no. 01IH15003A (project SKAMPY). |
| 400 | |
| 401 | This work was funded by KONWHIR project OMI4PAPS. |
| 402 | |
| 403 | |
| 404 | |
| 405 | .. |datetime| date:: %Y-%m-%d %H:%M |
| 406 | |
| 407 | Document was generated at |datetime|. |
| 408 | |