add citation information

[LbmBenchmarkKernelsPublic.git] / doc / main.html
diff --git a/doc/main.html b/doc/main.html

index 9f1186603c5019c8cbf6dad1ad6594bffb7d4de0..c5f822360e52fda01023e2a0ff6c43b7e56f80e4 100644 (file)
--- a/doc/main.html
+++ b/doc/main.html
@@ -401,6 +401,10 @@ tr:nth-child(odd) {
  <div class="line">Viktor Haag, 2016</div>
  <div class="line">LSS, University of Erlangen-Nuremberg, Germany</div>
  <div class="line"><br /></div>
+<div class="line">Michael Hussnaetter, 2017-2018</div>
+<div class="line">University of Erlangen-Nuremberg, Germany</div>
+<div class="line">michael.hussnaetter -at- fau.de</div>
+<div class="line"><br /></div>
  </div>
  <div class="line">This file is part of the Lattice Boltzmann Benchmark Kernels (LbmBenchKernels).</div>
  <div class="line"><br /></div>
@@ -581,14 +585,14 @@ make clean-all
  <td>Select GCC or Intel compiler.</td>
  </tr>
  <tr><td>ISA</td>
-<td>avx, sse</td>
+<td>avx512, avx, sse</td>
  <td>avx</td>
  <td>Determines which ISA extension is used for macro definitions of the intrinsics. This is <em>not</em> the architecture the compiler generates code for.</td>
  </tr>
  <tr><td>OPENMP</td>
  <td>on, off</td>
  <td>on</td>
-<td>OpenMP, i.,e.. threading support.</td>
+<td>OpenMP, i.e. threading support.</td>
  </tr>
  <tr><td>PRECISION</td>
  <td>dp, sp</td>
@@ -617,6 +621,51 @@ make clean-all
  </tr>
  </tbody>
  </table>
+<p><strong>Suboptions for ``ISA=avx512``</strong></p>
+<table border="1" class="docutils">
+<colgroup>
+<col width="20%" />
+<col width="5%" />
+<col width="5%" />
+<col width="69%" />
+</colgroup>
+<thead valign="bottom">
+<tr><th class="head">name</th>
+<th class="head">values</th>
+<th class="head">default</th>
+<th class="head">description</th>
+</tr>
+</thead>
+<tbody valign="top">
+<tr><td>ADJ_LIST_MEM_TYPE</td>
+<td>HBM</td>
+<td><ul class="first last simple">
+<li></li>
+</ul>
+</td>
+<td>Determines memory location of adjacency list array, DRAM or HBM.</td>
+</tr>
+<tr><td>PDF_MEM_TYPE</td>
+<td>HBM</td>
+<td><ul class="first last simple">
+<li></li>
+</ul>
+</td>
+<td>Determines memory location of PDF array, DRAM or HBM.</td>
+</tr>
+<tr><td>SOFTWARE_PREFETCH_LOOKAHEAD_L1</td>
+<td>int &gt;= 0</td>
+<td>0</td>
+<td>Software prefetch lookahead of elements into L1 cache, value is multiplied by vector size (<tt class="docutils literal">VSIZE</tt>).</td>
+</tr>
+<tr><td>SOFTWARE_PREFETCH_LOOKAHEAD_L2</td>
+<td>int &gt;= 0</td>
+<td>0</td>
+<td>Software prefetch lookahead of elements into L2 cache, value is multiplied by vector size (<tt class="docutils literal">VSIZE</tt>).</td>
+</tr>
+</tbody>
+</table>
+<p>Please note this options require AVX-512 PF support of the target processor.</p>
  </div>
  </div>
  <div class="section" id="invocation">
@@ -637,7 +686,7 @@ LBM Benchmark Kernels 0.1, compiled Jul  5 2017 21:59:22, type: benchmark
  Usage:
  ./lbmbenchk -list
  ./lbmbenchk
-    [-dims XxYyZ] [-geometry box|channel|pipe|blocks[-&lt;block size&gt;]] [-iterations &lt;iterations&gt;] [-lattice-dump-ascii]
+    [-dims XxYxZ] [-geometry box|channel|pipe|blocks[-&lt;block size&gt;]] [-iterations &lt;iterations&gt;] [-lattice-dump-ascii]
      [-rho-in &lt;density&gt;] [-rho-out &lt;density] [-omega &lt;omega&gt;] [-kernel &lt;kernel&gt;]
      [-periodic-x]
      [-t &lt;number of threads&gt;]
@@ -952,13 +1001,14 @@ created make sure the binary was compiled with:</p>
  <ul class="simple">
  <li><tt class="docutils literal">BENCHMARK=on</tt> (default if not overriden) and</li>
  <li><tt class="docutils literal">BUILD=release</tt> (default if not overriden) and</li>
-<li>the correct ISA for macros is used, selected via <tt class="docutils literal">ISA</tt> and</li>
+<li>the correct ISA for macros (i.e. intrinsics) is used, selected via <tt class="docutils literal">ISA</tt> and</li>
  <li>use <tt class="docutils literal">TARCH</tt> to specify the architecture the compiler generates code for.</li>
  </ul>
  <div class="section" id="intel-compiler">
  <h2><a class="toc-backref" href="#id18">4.1&nbsp;&nbsp;&nbsp;Intel Compiler</a></h2>
  <p>For the Intel compiler one can specify depending on the target ISA extension:</p>
  <ul class="simple">
+<li>SSE:          <tt class="docutils literal"><span class="pre">TARCH=-xSSE4.2</span></tt></li>
  <li>AVX:          <tt class="docutils literal"><span class="pre">TARCH=-xAVX</span></tt></li>
  <li>AVX2 and FMA: <tt class="docutils literal"><span class="pre">TARCH=-xCORE-AVX2,-fma</span></tt></li>
  <li>AVX512:       <tt class="docutils literal"><span class="pre">TARCH=-xCORE-AVX512</span></tt></li>
@@ -974,12 +1024,26 @@ make ISA=avx TARCH=-xCORE-AVX2,-fma
  </pre>
  <p>WARNING: ISA is here still set to <tt class="docutils literal">avx</tt> as currently we have the FMA intrinsics not
  implemented. This might change in the future.</p>
+<!-- TODO: add isa=avx512 and add docu for knl -->
+<!-- TODO: kein prefetching wenn AVX-512 PF nicht unterstuetz wird -->
  <p>Compiling for an architecture supporting AVX-512 (Skylake):</p>
  <pre class="literal-block">
-make ISA=avx TARCH=-xCORE-AVX512
+make ISA=avx512 TARCH=-xCORE-AVX512
+</pre>
+<p>Please note that for the AVX512 gather kernels software prefetching for the
+gather instructions is disabled per default.
+To enable it set <tt class="docutils literal">SOFTWARE_PREFETCH_LOOKAHEAD_L1</tt> and/or
+<tt class="docutils literal">SOFTWARE_PREFETCH_LOOKAHEAD_L2</tt> to a value greater than <tt class="docutils literal">0</tt> during
+compilation. Note that this requires AVX-512 PF support from the target
+processor.</p>
+<p>Compiling for MIC architecture KNL supporting AVX-512 and AVX-512 PF:</p>
+<pre class="literal-block">
+make ISA=avx512 TARCH=-xMIC-AVX512
+</pre>
+<p>or optionally with software prefetch enabled:</p>
+<pre class="literal-block">
+make ISA=avx512 TARCH=-xMIC-AVX512 SOFTWARE_PREFETCH_LOOKAHEAD_L1=&lt;value&gt; SOFTWARE_PREFETCH_LOOKAHEAD_L2=&lt;value&gt;
  </pre>
-<p>WARNING: ISA is here still set to <tt class="docutils literal">avx</tt> as currently we have no implementation for the
-AVX512 intrinsics. This might change in the future.</p>
  </div>
  <div class="section" id="pinning">
  <h2><a class="toc-backref" href="#id19">4.2&nbsp;&nbsp;&nbsp;Pinning</a></h2>
@@ -1105,7 +1169,6 @@ which mimics the kernels memory access pattern and the kernel's loop balance
  </li>
  </ul>
  <p><strong>Skylake, Intel Xeon Gold 6148</strong></p>
-<p>NOTE: currently we only use AVX2 intrinsics.</p>
  <ul class="simple">
  <li>Skylake server architecture, AVX2, AVX512, 2 FMA units</li>
  <li>20 cores, 2.4 GHz</li>
@@ -1177,11 +1240,11 @@ which mimics the kernels memory access pattern and the kernel's loop balance
  </tr>
  <tr><td><img alt="perf_meggie_sp" src="images/benchmark-meggie-sp.png" style="width: 1000.0px; height: 250.0px;" /></td>
  </tr>
-<tr><td>Skylake, Intel Xeon Gold 6148, Double Precision, <strong>NOTE: currently we only use AVX2 intrinsics.</strong></td>
+<tr><td>Skylake, Intel Xeon Gold 6148, Double Precision</td>
  </tr>
  <tr><td><img alt="perf_skylakesp2_dp" src="images/benchmark-skylakesp2-dp.png" style="width: 1000.0px; height: 250.0px;" /></td>
  </tr>
-<tr><td>Skylake, Intel Xeon Gold 6148, Single Precision, <strong>NOTE: currently we only use AVX2 intrinsics.</strong></td>
+<tr><td>Skylake, Intel Xeon Gold 6148, Single Precision</td>
  </tr>
  <tr><td><img alt="perf_skylakesp2_sp" src="images/benchmark-skylakesp2-sp.png" style="width: 1000.0px; height: 250.0px;" /></td>
  </tr>
@@ -1211,6 +1274,21 @@ which mimics the kernels memory access pattern and the kernel's loop balance
  </div>
  <div class="section" id="acknowledgements">
  <h1><a class="toc-backref" href="#id27">8&nbsp;&nbsp;&nbsp;Acknowledgements</a></h1>
+<p>If you use the benchmark kernels you can cite us:</p>
+<p>M. Wittmann, V. Haag, T. Zeiser, H. Köstler, and G. Wellein: Lattice Boltzmann
+Benchmark Kernels as a Testbed for Performance Analysis, (2018), Computer &amp;
+Fluids, Special Issue DSFD2017. doi:10.1016/j.compfluid.2018.03.030.</p>
+<p>Bibtex entry:</p>
+<pre class="literal-block">
+&#64;article{wittmann-2018,
+    author  = {M. Wittmann and V. Haag and T. Zeiser and H. K\&quot;ostler and G. Wellein},
+    title   = {Lattice {B}oltzmann benchmark kernels as a testbed for performance analysis},
+    journal = {Computers \&amp; Fluids},
+    year    = {2018},
+    issn    = {0045-7930},
+    doi     = {10.1016/j.compfluid.2018.03.030},
+}
+</pre>
  <p>This work was funded by BMBF, grant no. 01IH15003A (project SKAMPY).</p>
  <p>This work was funded by KONWHIR project OMI4PAPS.</p>
  </div>
@@ -1232,7 +1310,7 @@ Roofline: an insightful visual performance model for multicore architectures.
  Commun. ACM, 52(4):65-76, Apr 2009. doi:10.1145/1498765.1498785</td></tr>
  </tbody>
  </table>
-<p>Document was generated at 2018-01-09 11:54.</p>
+<p>Document was generated at 2018-06-06 10:38.</p>
  </div>
  </div>
  </body>