<div class="line">Viktor Haag, 2016</div>
<div class="line">LSS, University of Erlangen-Nuremberg, Germany</div>
<div class="line"><br /></div>
+<div class="line">Michael Hussnaetter, 2017-2018</div>
+<div class="line">University of Erlangen-Nuremberg, Germany</div>
+<div class="line">michael.hussnaetter -at- fau.de</div>
+<div class="line"><br /></div>
</div>
<div class="line">This file is part of the Lattice Boltzmann Benchmark Kernels (LbmBenchKernels).</div>
<div class="line"><br /></div>
<td>Select GCC or Intel compiler.</td>
</tr>
<tr><td>ISA</td>
-<td>avx, sse</td>
+<td>avx512, avx, sse</td>
<td>avx</td>
<td>Determines which ISA extension is used for macro definitions of the intrinsics. This is <em>not</em> the architecture the compiler generates code for.</td>
</tr>
<tr><td>OPENMP</td>
<td>on, off</td>
<td>on</td>
-<td>OpenMP, i.,e.. threading support.</td>
+<td>OpenMP, i.e. threading support.</td>
</tr>
<tr><td>PRECISION</td>
<td>dp, sp</td>
</tr>
</tbody>
</table>
+<p><strong>Suboptions for ``ISA=avx512``</strong></p>
+<table border="1" class="docutils">
+<colgroup>
+<col width="20%" />
+<col width="5%" />
+<col width="5%" />
+<col width="69%" />
+</colgroup>
+<thead valign="bottom">
+<tr><th class="head">name</th>
+<th class="head">values</th>
+<th class="head">default</th>
+<th class="head">description</th>
+</tr>
+</thead>
+<tbody valign="top">
+<tr><td>ADJ_LIST_MEM_TYPE</td>
+<td>HBM</td>
+<td><ul class="first last simple">
+<li></li>
+</ul>
+</td>
+<td>Determines memory location of adjacency list array, DRAM or HBM.</td>
+</tr>
+<tr><td>PDF_MEM_TYPE</td>
+<td>HBM</td>
+<td><ul class="first last simple">
+<li></li>
+</ul>
+</td>
+<td>Determines memory location of PDF array, DRAM or HBM.</td>
+</tr>
+<tr><td>SOFTWARE_PREFETCH_LOOKAHEAD_L1</td>
+<td>int >= 0</td>
+<td>0</td>
+<td>Software prefetch lookahead of elements into L1 cache, value is multiplied by vector size (<tt class="docutils literal">VSIZE</tt>).</td>
+</tr>
+<tr><td>SOFTWARE_PREFETCH_LOOKAHEAD_L2</td>
+<td>int >= 0</td>
+<td>0</td>
+<td>Software prefetch lookahead of elements into L2 cache, value is multiplied by vector size (<tt class="docutils literal">VSIZE</tt>).</td>
+</tr>
+</tbody>
+</table>
+<p>Please note this options require AVX-512 PF support of the target processor.</p>
</div>
</div>
<div class="section" id="invocation">
Usage:
./lbmbenchk -list
./lbmbenchk
- [-dims XxYyZ] [-geometry box|channel|pipe|blocks[-<block size>]] [-iterations <iterations>] [-lattice-dump-ascii]
+ [-dims XxYxZ] [-geometry box|channel|pipe|blocks[-<block size>]] [-iterations <iterations>] [-lattice-dump-ascii]
[-rho-in <density>] [-rho-out <density] [-omega <omega>] [-kernel <kernel>]
[-periodic-x]
[-t <number of threads>]
<ul class="simple">
<li><tt class="docutils literal">BENCHMARK=on</tt> (default if not overriden) and</li>
<li><tt class="docutils literal">BUILD=release</tt> (default if not overriden) and</li>
-<li>the correct ISA for macros is used, selected via <tt class="docutils literal">ISA</tt> and</li>
+<li>the correct ISA for macros (i.e. intrinsics) is used, selected via <tt class="docutils literal">ISA</tt> and</li>
<li>use <tt class="docutils literal">TARCH</tt> to specify the architecture the compiler generates code for.</li>
</ul>
<div class="section" id="intel-compiler">
<h2><a class="toc-backref" href="#id18">4.1 Intel Compiler</a></h2>
<p>For the Intel compiler one can specify depending on the target ISA extension:</p>
<ul class="simple">
+<li>SSE: <tt class="docutils literal"><span class="pre">TARCH=-xSSE4.2</span></tt></li>
<li>AVX: <tt class="docutils literal"><span class="pre">TARCH=-xAVX</span></tt></li>
<li>AVX2 and FMA: <tt class="docutils literal"><span class="pre">TARCH=-xCORE-AVX2,-fma</span></tt></li>
<li>AVX512: <tt class="docutils literal"><span class="pre">TARCH=-xCORE-AVX512</span></tt></li>
</pre>
<p>WARNING: ISA is here still set to <tt class="docutils literal">avx</tt> as currently we have the FMA intrinsics not
implemented. This might change in the future.</p>
+<!-- TODO: add isa=avx512 and add docu for knl -->
+<!-- TODO: kein prefetching wenn AVX-512 PF nicht unterstuetz wird -->
<p>Compiling for an architecture supporting AVX-512 (Skylake):</p>
<pre class="literal-block">
-make ISA=avx TARCH=-xCORE-AVX512
+make ISA=avx512 TARCH=-xCORE-AVX512
+</pre>
+<p>Please note that for the AVX512 gather kernels software prefetching for the
+gather instructions is disabled per default.
+To enable it set <tt class="docutils literal">SOFTWARE_PREFETCH_LOOKAHEAD_L1</tt> and/or
+<tt class="docutils literal">SOFTWARE_PREFETCH_LOOKAHEAD_L2</tt> to a value greater than <tt class="docutils literal">0</tt> during
+compilation. Note that this requires AVX-512 PF support from the target
+processor.</p>
+<p>Compiling for MIC architecture KNL supporting AVX-512 and AVX-512 PF:</p>
+<pre class="literal-block">
+make ISA=avx512 TARCH=-xMIC-AVX512
+</pre>
+<p>or optionally with software prefetch enabled:</p>
+<pre class="literal-block">
+make ISA=avx512 TARCH=-xMIC-AVX512 SOFTWARE_PREFETCH_LOOKAHEAD_L1=<value> SOFTWARE_PREFETCH_LOOKAHEAD_L2=<value>
</pre>
-<p>WARNING: ISA is here still set to <tt class="docutils literal">avx</tt> as currently we have no implementation for the
-AVX512 intrinsics. This might change in the future.</p>
</div>
<div class="section" id="pinning">
<h2><a class="toc-backref" href="#id19">4.2 Pinning</a></h2>
</li>
</ul>
<p><strong>Skylake, Intel Xeon Gold 6148</strong></p>
-<p>NOTE: currently we only use AVX2 intrinsics.</p>
<ul class="simple">
<li>Skylake server architecture, AVX2, AVX512, 2 FMA units</li>
<li>20 cores, 2.4 GHz</li>
</tr>
<tr><td><img alt="perf_meggie_sp" src="images/benchmark-meggie-sp.png" style="width: 1000.0px; height: 250.0px;" /></td>
</tr>
-<tr><td>Skylake, Intel Xeon Gold 6148, Double Precision, <strong>NOTE: currently we only use AVX2 intrinsics.</strong></td>
+<tr><td>Skylake, Intel Xeon Gold 6148, Double Precision</td>
</tr>
<tr><td><img alt="perf_skylakesp2_dp" src="images/benchmark-skylakesp2-dp.png" style="width: 1000.0px; height: 250.0px;" /></td>
</tr>
-<tr><td>Skylake, Intel Xeon Gold 6148, Single Precision, <strong>NOTE: currently we only use AVX2 intrinsics.</strong></td>
+<tr><td>Skylake, Intel Xeon Gold 6148, Single Precision</td>
</tr>
<tr><td><img alt="perf_skylakesp2_sp" src="images/benchmark-skylakesp2-sp.png" style="width: 1000.0px; height: 250.0px;" /></td>
</tr>
</div>
<div class="section" id="acknowledgements">
<h1><a class="toc-backref" href="#id27">8 Acknowledgements</a></h1>
+<p>If you use the benchmark kernels you can cite us:</p>
+<p>M. Wittmann, V. Haag, T. Zeiser, H. Köstler, and G. Wellein: Lattice Boltzmann
+Benchmark Kernels as a Testbed for Performance Analysis, (2018), Computer &
+Fluids, Special Issue DSFD2017. doi:10.1016/j.compfluid.2018.03.030.</p>
+<p>Bibtex entry:</p>
+<pre class="literal-block">
+@article{wittmann-2018,
+ author = {M. Wittmann and V. Haag and T. Zeiser and H. K\"ostler and G. Wellein},
+ title = {Lattice {B}oltzmann benchmark kernels as a testbed for performance analysis},
+ journal = {Computers \& Fluids},
+ year = {2018},
+ issn = {0045-7930},
+ doi = {10.1016/j.compfluid.2018.03.030},
+}
+</pre>
<p>This work was funded by BMBF, grant no. 01IH15003A (project SKAMPY).</p>
<p>This work was funded by KONWHIR project OMI4PAPS.</p>
</div>
Commun. ACM, 52(4):65-76, Apr 2009. doi:10.1145/1498765.1498785</td></tr>
</tbody>
</table>
-<p>Document was generated at 2018-01-09 11:54.</p>
+<p>Document was generated at 2018-06-06 10:38.</p>
</div>
</div>
</body>