X-Git-Url: http://git.rrze.uni-erlangen.de/gitweb/?p=LbmBenchmarkKernelsPublic.git;a=blobdiff_plain;f=doc%2Fhtml%2Fmain.html;fp=doc%2Fhtml%2Fmain.html;h=511f6b2dcf0bfd7cd2be208fc20e3d3f83c391e3;hp=36579ccbf7719cfdebdce12ad9b4cd4732c548ba;hb=e3f82424829ebb623343ce0092238f83b4a1b8c2;hpb=ecf590ae9bb13ba2b2f01c3bf7a53056a8b1467b diff --git a/doc/html/main.html b/doc/html/main.html index 36579cc..511f6b2 100644 --- a/doc/html/main.html +++ b/doc/html/main.html @@ -335,6 +335,56 @@ h4 tt.docutils, h5 tt.docutils, h6 tt.docutils { ul.auto-toc { list-style-type: none } + + @@ -375,15 +425,23 @@ ul.auto-toc {
  • 1.2   Benchmarking
  • 1.3   Release and Verification
  • 1.4   Compilers
  • -
  • 1.5   Options Summary
  • +
  • 1.5   Cleaning
  • +
  • 1.6   Options Summary
  • -
  • 2   Invocation
    @@ -399,15 +457,15 @@ depending on compiler and build configuration.

    1.1   Debug and Verification

    -make
    +make BUILD=debug BENCHMARK=off
     
    -

    Running make without any arguments builds the debug version (BUILD=debug) of +

    Running make with BUILD=debug builds the debug version of the benchmark kernels, where no optimizations are performed, line numbers and debug symbols are included as well as DEBUG will be defined. The resulting binary will be found in the bin subdirectory and named lbmbenchk-linux-<compiler>-debug.

    -

    Without any further specification the binary includes verification -(VERIFICATION=on), statistics (STATISTICS), and VTK output +

    Specifying BENCHMARK=off turns on verification +(VERIFICATION=on), statistics (STATISTICS=on), and VTK output (VTK_OUTPUT=on) enabled.

    Please note that the generated binary will therefore exhibit a poor performance.

    @@ -416,9 +474,10 @@ exhibit a poor performance.

    1.2   Benchmarking

    To generate a binary for benchmarking run make with

    -make BENCHMARK=on BUILD=release
    +make
     
    -

    Here BUILD=release turns optimizations on and BENCHMARK=on disables +

    As default BENCHMARK=on and BUILD=release is set, where +BUILD=release turns optimizations on and BENCHMARK=on disables verfification, statistics, and VTK output.

    @@ -426,7 +485,7 @@ verfification, statistics, and VTK output.

    Verification with the debug builds can be extremely slow. Hence verification capabilities can be build with release builds:

    -make BUILD=release
    +make BENCHMARK=off
     
    @@ -435,8 +494,23 @@ make BUILD=release both configuration can be chosen via CONFIG=linux-gcc or CONFIG=linux-intel.

    +
    +

    1.5   Cleaning

    +

    For each configuration and build (debug/release) a subdirectory under the +src/obj directory is created where the dependency and object files are +stored. +With

    +
    +make CONFIG=... BUILD=... clean
    +
    +

    a specific combination is select and cleaned, whereas with

    +
    +make clean-all
    +
    +

    all object and dependency files are deleted.

    +
    -

    1.5   Options Summary

    +

    1.6   Options Summary

    Options that can be specified when building the framework with make:

    @@ -451,19 +525,14 @@ both configuration can be chosen via
    +++++++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
    kernel nameprop. stepdata layoutaddr.parallelblockingB_l [B/FLUP]
    push-soaOSSoADx--456
    push-aosOSAoSDx--456
    pull-soaOSSoADx--456
    pull-aosOSAoSDx--456
    blk-push-soaOSSoADxx456
    blk-push-aosOSAoSDxx456
    blk-pull-soaOSSoADxx456
    blk-pull-aosOSAoSDxx456
    list-push-soaOSSoAIxx528
    list-push-aosOSAoSIxx528
    list-pull-soaOSSoAIxx528
    list-pull-aosOSAoSIxx528
    list-pull-split-nt-1sOSSoAIxx376
    list-pull-split-nt-2sOSSoAIxx376
    list-aa-soaAASoAIxx340
    list-aa-aosAAAoSIxx340
    list-aa-ria-soaAASoAIxx304-342
    list-aa-pv-soaAASoAIxx304-342
    +
    -

    3   Benchmarking

    +

    3   Benchmarking

    Correct benchmarking is a nontrivial task. Whenever benchmark results should be created make sure the binary was compiled with:

      -
    • BENCHMARK=on and
    • -
    • BUILD=release and
    • +
    • BENCHMARK=on (default if not overriden) and
    • +
    • BUILD=release (default if not overriden) and
    • the correct ISA for macros is used, selected via ISA and
    • use TARCH to specify the architecture the compiler generates code for.
    @@ -594,7 +881,13 @@ $ bin/lbmbenchk-linux-intel-release ... -t 10 -pin $(seq -s , 0 9)
    • transparent huge pages: when allocating memory small 4 KiB pages might be replaced with larger ones. This is in general a good thing, but if this is -really the case, depends on the system settings.
    • +really the case, depends on the system settings (check e.g. the status of +/sys/kernel/mm/transparent_hugepage/enabled). +Currently madvise(MADV_HUGEPAGE) is used for allocations which are aligned to +a 4 KiB page, which should be the case for the lattices. +This should result in huge pages except THP is disabled on the machine. +(NOTE: madvise() is used if HAVE_HUGE_PAGES is defined, which is currently +hard coded defined in Memory.c).
    • CPU/core frequency: For reproducible results the frequency of all cores should be fixed.
    • NUMA placement policy: The benchmark assumes a first touch policy, which @@ -604,19 +897,63 @@ used is already full memory might be allocated in a remote domain. Accesses to remote domains typically have a higher latency and lower bandwidth.
    • System load: interference with other application, espcially on desktop systems should be avoided.
    • -
    • Padding: most kernels do not care about padding against cache or TLB -thrashing. Even if the number of (fluid) nodes suggest everything is fine, -through parallelization still problems might occur.
    • +
    • Padding: For SoA based kernels the number of (fluid) nodes is automatically +adjusted so that no cache or TLB thrashing should occur. The parameters are +optimized for current Intel based systems. For more details look into the +padding section.
    • CPU dispatcher function: the compiler might add different versions of a function for different ISA extensions. Make sure the code you might think is executed is actually the code which is executed.
    +
    +

    3.1   Padding

    +

    With correct padding cache and TLB thrashing can be avoided. Therefore the +number of (fluid) nodes used in the data layout is artificially increased.

    +

    Currently automatic padding is active for kernels which support it. It can be +controlled via the kernel parameter (i.e. parameter after the --) +-pad. Supported values are auto (default), no (to disable padding), +or a manual padding.

    +

    Automatic padding tries to avoid cache and TLB thrashing and pads for a 32 +entry (huge pages) TLB with 8 sets and a 512 set (L2) cache. This reflects the +parameters of current Intel based processors.

    +

    Manual padding is done via a padding string and has the format +mod_1+offset_1(,mod_n+offset_n), which specifies numbers of bytes. +SoA data layouts can exhibit TLB thrashing. Therefore we want to distribute the +19 pages with one lattice (36 with two lattices) we are concurrently accessing +over as much sets in the TLB as possible. +This is controlled by the distance between the accessed pages, which is the +number of (fluid) nodes in between them and can be adjusted by adding further +(fluid) nodes. +We want the distance d (in bytes) between two accessed pages to be e.g. +d % (PAGE_SIZE * TLB_SETS) = PAGE_SIZE. +This would distribute the pages evenly over the sets. Hereby PAGE_SIZE * TLB_SETS +would be our mod_1 and PAGE_SIZE (after the =) our offset_1. +Measurements show that with only a quarter of half of a page size as offset +higher performance is achieved, which is done by automatic padding. +On top of this padding more paddings can be added. They are just added to the +padding string and are separated by commas.

    +

    A zero modulus in the padding string has a special meaning. Here the +corresponding offset is just added to the number of nodes. A padding string +like -pad 0+16 would at a static padding of two nodes (one node = 8 b).

    +
    +
    +
    +

    4   Geometries

    +

    TODO: supported geometries: channel, pipe, blocks

    +
    +
    +

    5   Results

    +

    TODO

    +
    +
    +

    6   Licence

    +

    The Lattice Boltzmann Benchmark Kernels are licensed under GPLv3.

    -

    4   Acknowledgements

    +

    7   Acknowledgements

    This work was funded by BMBF, grant no. 01IH15003A (project SKAMPY).

    This work was funded by KONWHIR project OMI4PAPS.

    -

    Document was generated at 2017-10-26 09:43.

    +

    Document was generated at 2017-11-02 15:33.