From: Markus Wittmann Date: Wed, 10 Jan 2018 13:25:18 +0000 (+0100) Subject: add single precision, add aa-vec-sl-soa kernel, updated doc X-Git-Url: http://git.rrze.uni-erlangen.de/gitweb/?p=LbmBenchmarkKernelsPublic.git;a=commitdiff_plain;h=0fde6e45e9be83893afae896cf49a799777f6d7c add single precision, add aa-vec-sl-soa kernel, updated doc - Binaries have now a -dp or -sp suffix, depending on whether they have been compiled for double or single precision. - New kernel for full array aa-vec-sl-soa added. Only one loop over the lattice used. - Documentation has been updated, including how to build single precision binaries and performance graphs on various architectures. --- diff --git a/README b/README index e97abb0..d1118ca 100644 --- a/README +++ b/README @@ -11,7 +11,7 @@ EXPERIMENTS. The benchmark suite was created by the HPC group of Erlangen Regional Computing Center (RRZE/HPC) [1] and the Chair for System Simulation of FAU (LSS) [2]. -See doc/main.rst or doc/html/main.html subdirectories for rudimentary +See doc/main.rst or doc/main.html subdirectories for rudimentary documentation. diff --git a/doc/Makefile b/doc/Makefile index 6417a6e..b059a99 100644 --- a/doc/Makefile +++ b/doc/Makefile @@ -1,7 +1,7 @@ # -------------------------------------------------------------------------- # # Copyright -# Markus Wittmann, 2016-2017 +# Markus Wittmann, 2016-2018 # RRZE, University of Erlangen-Nuremberg, Germany # markus.wittmann -at- fau.de or hpc -at- rrze.fau.de # @@ -33,6 +33,5 @@ all: main main: main.rst #main.css [ -d html ] || mkdir -p html -# rst2html --stylesheet=html4css1.css,main.css $< html/$@.html - rst2html --stylesheet=html4css1.css,main.css $< html/$@.html + rst2html --stylesheet=html4css1.css,main.css $< $@.html diff --git a/doc/html/main.html b/doc/html/main.html deleted file mode 100644 index 99f4cb8..0000000 --- a/doc/html/main.html +++ /dev/null @@ -1,1910 +0,0 @@ - - - - - - -LBM Benchmark Kernels Documentation - - - - -
-

LBM Benchmark Kernels Documentation

- - - -
-

1   Introduction

-

The lattice Boltzmann (LBM) benchmark kernels are a collection of LBM kernel -implementations.

-

AS SUCH THE LBM BENCHMARK KERNELS ARE NO FULLY EQUIPPED CFD SOLVER AND SOLELY -SERVES THE PURPOSE OF STUDYING POSSIBLE PERFORMANCE OPTIMIZATIONS AND/OR -EXPERIMENTS.

-

Currently all kernels utilize a D3Q19 discretization and the -two-relaxation-time (TRT) collision operator [ginzburg-2008]. -All operations are carried out in double precision arithmetic.

-
-
-

2   Compilation

-

The benchmark framework currently supports only Linux systems and the GCC and -Intel compilers. Every other configuration probably requires adjustment inside -the code and the makefiles. Furthermore some code might be platform or at least -POSIX specific.

-

The benchmark can be build via make from the src subdirectory. This will -generate one binary which hosts all implemented benchmark kernels.

-

Binaries are located under the bin subdirectory and will have different names -depending on compiler and build configuration.

-

Compilation can target debug or release builds. Combined with both build types -verification can be enabled, which increases the runtime and hence is not -suited for benchmarking.

-
-

2.1   Debug and Verification

-
-make BUILD=debug BENCHMARK=off
-
-

Running make with BUILD=debug builds the debug version of -the benchmark kernels, where no optimizations are performed, line numbers and -debug symbols are included as well as DEBUG will be defined. The resulting -binary will be found in the bin subdirectory and named -lbmbenchk-linux-<compiler>-debug.

-

Specifying BENCHMARK=off turns on verification -(VERIFICATION=on), statistics (STATISTICS=on), and VTK output -(VTK_OUTPUT=on) enabled.

-

Please note that the generated binary will therefore -exhibit a poor performance.

-
-
-

2.2   Release and Verification

-

Verification with the debug builds can be extremely slow. Hence verification -capabilities can be build with release builds:

-
-make BENCHMARK=off
-
-
-
-

2.3   Benchmarking

-

To generate a binary for benchmarking run make with

-
-make
-
-

As default BENCHMARK=on and BUILD=release is set, where -BUILD=release turns optimizations on and BENCHMARK=on disables -verfification, statistics, and VTK output.

-

See Options Summary below for further description of options which can be -applied, e.g. TARCH as well as the Benchmarking section.

-
-
-

2.4   Compilers

-

Currently only the GCC and Intel compiler under Linux are supported. Between -both configuration can be chosen via CONFIG=linux-gcc or -CONFIG=linux-intel.

-
-
-

2.5   Cleaning

-

For each configuration and build (debug/release) a subdirectory under the -src/obj directory is created where the dependency and object files are -stored. -With

-
-make CONFIG=... BUILD=... clean
-
-

a specific combination is select and cleaned, whereas with

-
-make clean-all
-
-

all object and dependency files are deleted.

-
-
-

2.6   Options Summary

-

Options that can be specified when building the suite with make:

- ------ - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
namevaluesdefaultdescription
BENCHMARKon, offonIf enabled, disables VERIFICATION, STATISTICS, VTK_OUTPUT. If disabled enables the three former options.
BUILDdebug, releasereleasedebug: no optimization, debug symbols, DEBUG defined. release: optimizations enabled.
CONFIGlinux-gcc, linux-intellinux-intelSelect GCC or Intel compiler.
ISAavx, sseavxDetermines which ISA extension is used for macro definitions of the intrinsics. This is not the architecture the compiler generates code for.
OPENMPon, offonOpenMP, i.,e.. threading support.
STATISTICSon, offoffView statistics, like density etc, during simulation.
TARCH----Via TARCH the architecture the compiler generates code for can be overridden. The value depends on the chosen compiler.
VERIFICATIONon, offoffTurn verification on/off.
VTK_OUTPUTon, offoffEnable/Disable VTK file output.
-
-
-
-

3   Invocation

-

Running the binary will print among the GPL licence header a line like the following:

-
-LBM Benchmark Kernels 0.1, compiled Jul  5 2017 21:59:22, type: verification
-
-

if verfication was enabled during compilation or

-
-LBM Benchmark Kernels 0.1, compiled Jul  5 2017 21:59:22, type: benchmark
-
-

if verfication was disabled during compilation.

-
-

3.1   Command Line Parameters

-

Running the binary with -h list all available parameters:

-
-Usage:
-./lbmbenchk -list
-./lbmbenchk
-    [-dims XxYyZ] [-geometry box|channel|pipe|blocks[-<block size>]] [-iterations <iterations>] [-lattice-dump-ascii]
-    [-rho-in <density>] [-rho-out <density] [-omega <omega>] [-kernel <kernel>]
-    [-periodic-x]
-    [-t <number of threads>]
-    [-pin core{,core}*]
-    [-verify]
-    -- <kernel specific parameters>
-
--list           List available kernels.
-
--dims XxYxZ     Specify geometry dimensions.
-
--geometry blocks-<block size>
-                Geometetry with blocks of size <block size> regularily layout out.
-
-

If an option is specified multiple times the last one overrides previous ones. -This holds also true for -verify which sets geometry dimensions, -iterations, etc, which can afterward be override, e.g.:

-
-$ bin/lbmbenchk-linux-intel-release -verfiy -dims 32x32x32
-
-

Kernel specific parameters can be obtained via selecting the specific kernel -and passing -h as parameter:

-
-$ bin/lbmbenchk-linux-intel-release -kernel kernel-name -- -h
-...
-Kernel parameters:
-[-blk <n>] [-blk-[xyz] <n>]
-
-

A list of all available kernels can be obtained via -list:

-
-$ ../bin/lbmbenchk-linux-gcc-debug -list
-Lattice Boltzmann Benchmark Kernels (LbmBenchKernels) Copyright (C) 2016, 2017 LSS, RRZE
-This program comes with ABSOLUTELY NO WARRANTY; for details see LICENSE.
-This is free software, and you are welcome to redistribute it under certain conditions.
-
-LBM Benchmark Kernels 0.1, compiled Jul  5 2017 21:59:22, type: verification
-Available kernels to benchmark:
-   list-aa-pv-soa
-   list-aa-ria-soa
-   list-aa-soa
-   list-aa-aos
-   list-pull-split-nt-1s-soa
-   list-pull-split-nt-2s-soa
-   list-push-soa
-   list-push-aos
-   list-pull-soa
-   list-pull-aos
-   push-soa
-   push-aos
-   pull-soa
-   pull-aos
-   blk-push-soa
-   blk-push-aos
-   blk-pull-soa
-   blk-pull-aos
-
-
-
-

3.2   Kernels

-

The following list shortly describes available kernels:

-
    -
  • push-soa/push-aos/pull-soa/pull-aos: -Unoptimized kernels (but stream/collide are already fused) using two grids as -source and destination. Implement push/pull semantics as well structure of -arrays (soa) or array of structures (aos) layout.
  • -
  • blk-push-soa/blk-push-aos/blk-pull-soa/blk-pull-aos: -The same as the unoptimized kernels without the blk prefix, except that they support -spatial blocking, i.e. loop blocking of the three loops used to iterate over -the lattice. Here manual work sharing for OpenMP is used.
  • -
  • list-push-soa/list-push-aos/list-pull-soa/list-pull-aos: -The same as the unoptimized kernels without the list prefix, but for indirect addressing. -Here only a 1D vector of is used to store the fluid nodes, omitting the -obstacles. An adjacency list is used to recover the neighborhood associations.
  • -
  • list-pull-split-nt-1s-soa/list-pull-split-nt-2s-soa: -Optimized variant of list-pull-soa. Chunks of the lattice are processed as -once. Postcollision values are written back via nontemporal stores in 18 (1s) -or 9 (2s) loops.
  • -
  • list-aa-aos/list-aa-soa: -Unoptimized implementation of the AA pattern for the 1D vector with adjacency -list. Supported are array of structures (aos) and structure of arrays (soa) -data layout is supported.
  • -
  • list-aa-ria-soa: -Implementation of AA pattern with intrinsics for the 1D vector with adjacency -list. Furthermore it contains a vectorized even time step and run length -coding to reduce the loop balance of the odd time step.
  • -
  • list-aa-pv-soa: -All optimizations of list-aa-ria-soa. Additional with partial vectorization -of the odd time step.
  • -
-

Note that all array of structures (aos) kernels might require blocking -(depending on the domain size) to reach the performance of their structure of -arrays (soa) counter parts.

-

The following table summarizes the properties of the kernels. Here D means -direct addressing, i.e. full array, I means indirect addressing, i.e. 1D -vector with adjacency list, x means supported, whereas -- means unsupported. -The loop balance B_l is computed for D3Q19 model with double precision floating -point for PDFs (8 byte) and 4 byte integers for the index (adjacency list). -As list-aa-ria-soa and list-aa-pv-soa support run length coding their effective -loop balance depends on the geometry. The effective loop balance is printed -during each run.

- --------- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
kernel nameprop. stepdata layoutaddr.parallelblockingB_l [B/FLUP]
push-soaOSSoADx--456
push-aosOSAoSDx--456
pull-soaOSSoADx--456
pull-aosOSAoSDx--456
blk-push-soaOSSoADxx456
blk-push-aosOSAoSDxx456
blk-pull-soaOSSoADxx456
blk-pull-aosOSAoSDxx456
list-push-soaOSSoAIxx528
list-push-aosOSAoSIxx528
list-pull-soaOSSoAIxx528
list-pull-aosOSAoSIxx528
list-pull-split-nt-1sOSSoAIxx376
list-pull-split-nt-2sOSSoAIxx376
list-aa-soaAASoAIxx340
list-aa-aosAAAoSIxx340
list-aa-ria-soaAASoAIxx304-342
list-aa-pv-soaAASoAIxx304-342
-
-
-
-

4   Benchmarking

-

Correct benchmarking is a nontrivial task. Whenever benchmark results should be -created make sure the binary was compiled with:

-
    -
  • BENCHMARK=on (default if not overriden) and
  • -
  • BUILD=release (default if not overriden) and
  • -
  • the correct ISA for macros is used, selected via ISA and
  • -
  • use TARCH to specify the architecture the compiler generates code for.
  • -
-
-

4.1   Intel Compiler

-

For the Intel compiler one can specify depending on the target ISA extension:

-
    -
  • AVX: TARCH=-xAVX
  • -
  • AVX2 and FMA: TARCH=-xCORE-AVX2,-fma
  • -
  • AVX512: TARCH=-xCORE-AVX512
  • -
  • KNL: TARCH=-xMIC-AVX512
  • -
-

Compiling for an architecture supporting AVX (Sandy Bridge, Ivy Bridge):

-
-make ISA=avx TARCH=-xAVX
-
-

Compiling for an architecture supporting AVX2 (Haswell, Broadwell):

-
-make ISA=avx TARCH=-xCORE-AVX2,-fma
-
-

WARNING: ISA is here still set to avx as currently we have the FMA intrinsics not -implemented. This might change in the future.

-

Compiling for an architecture supporting AVX-512 (Skylake):

-
-make ISA=avx TARCH=-xCORE-AVX512
-
-

WARNING: ISA is here still set to avx as currently we have no implementation for the -AVX512 intrinsics. This might change in the future.

-
-
-

4.2   Pinning

-

During benchmarking pinning should be used via the -pin parameter. Running -a benchmark with 10 threads and pin them to the first 10 cores works like

-
-$ bin/lbmbenchk-linux-intel-release ... -t 10 -pin $(seq -s , 0 9)
-
-
-
-

4.3   General Remarks

-

Things the binary does nor check or control:

-
    -
  • transparent huge pages: when allocating memory small 4 KiB pages might be -replaced with larger ones. This is in general a good thing, but if this is -really the case, depends on the system settings (check e.g. the status of -/sys/kernel/mm/transparent_hugepage/enabled). -Currently madvise(MADV_HUGEPAGE) is used for allocations which are aligned to -a 4 KiB page, which should be the case for the lattices. -This should result in huge pages except THP is disabled on the machine. -(NOTE: madvise() is used if HAVE_HUGE_PAGES is defined, which is currently -hard coded defined in Memory.c).
  • -
  • CPU/core frequency: For reproducible results the frequency of all cores -should be fixed.
  • -
  • NUMA placement policy: The benchmark assumes a first touch policy, which -means the memory will be placed at the NUMA domain the touching core is -associated with. If a different policy is in place or the NUMA domain to be -used is already full memory might be allocated in a remote domain. Accesses -to remote domains typically have a higher latency and lower bandwidth.
  • -
  • System load: interference with other application, especially on desktop -systems should be avoided.
  • -
  • Padding: For SoA based kernels the number of (fluid) nodes is automatically -adjusted so that no cache or TLB thrashing should occur. The parameters are -optimized for current Intel based systems. For more details look into the -padding section.
  • -
  • CPU dispatcher function: the compiler might add different versions of a -function for different ISA extensions. Make sure the code you might think is -executed is actually the code which is executed.
  • -
-
-
-

4.4   Padding

-

With correct padding cache and TLB thrashing can be avoided. Therefore the -number of (fluid) nodes used in the data layout is artificially increased.

-

Currently automatic padding is active for kernels which support it. It can be -controlled via the kernel parameter (i.e. parameter after the --) --pad. Supported values are auto (default), no (to disable padding), -or a manual padding.

-

Automatic padding tries to avoid cache and TLB thrashing and pads for a 32 -entry (huge pages) TLB with 8 sets and a 512 set (L2) cache. This reflects the -parameters of current Intel based processors.

-

Manual padding is done via a padding string and has the format -mod_1+offset_1(,mod_n+offset_n), which specifies numbers of bytes. -SoA data layouts can exhibit TLB thrashing. Therefore we want to distribute the -19 pages with one lattice (36 with two lattices) we are concurrently accessing -over as much sets in the TLB as possible. -This is controlled by the distance between the accessed pages, which is the -number of (fluid) nodes in between them and can be adjusted by adding further -(fluid) nodes. -We want the distance d (in bytes) between two accessed pages to be e.g. -d % (PAGE_SIZE * TLB_SETS) = PAGE_SIZE. -This would distribute the pages evenly over the sets. Hereby PAGE_SIZE * TLB_SETS -would be our mod_1 and PAGE_SIZE (after the =) our offset_1. -Measurements show that with only a quarter of half of a page size as offset -higher performance is achieved, which is done by automatic padding. -On top of this padding more paddings can be added. They are just added to the -padding string and are separated by commas.

-

A zero modulus in the padding string has a special meaning. Here the -corresponding offset is just added to the number of nodes. A padding string -like -pad 0+16 would at a static padding of two nodes (one node = 8 b).

-
-
-
-

5   Geometries

-

TODO: supported geometries: channel, pipe, blocks, fluid

-
-
-

6   Performance Results

-

The sections lists performance values measured on several machines for -different kernels and geometries. -The RFM column denotes the expected performance as predicted by the -Roofline performance model [williams-2008]. -For performance prediction of each kernel a memory bandwidth benchmark is used -which mimics the kernels memory access pattern and the kernel's loop balance -(see [kernels] for details).

-
-

6.1   Haswell, Intel Xeon E5-2695 v3

-
    -
  • Haswell architecture, AVX2, FMA
  • -
  • 14 cores, 2,3 GHz
  • -
  • 2 x 7 cores in cluster-on-die (CoD) mode enabled
  • -
  • SMT enabled
  • -
-

memory bandwidth:

-
    -
  • copy-19 47.3 GB/s
  • -
  • copy-19-nt-sl 47.1 GB/s
  • -
  • update-19 44.0 GB/s
  • -
-

geometry dimensions: 500x100x100

- --------------- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
kernelpipeblocks-2blocks-4blocks-6blocks-8blocks-10blocks-15blocks-16blocks-20blocks-25blocks-32RFM
blk-push-aos58.8249.8557.3459.9061.3762.1765.3064.0067.5464.4669.69104
blk-push-soa32.3233.4634.0234.6435.0635.0436.3135.4437.2035.1437.95104
blk-pull-aos56.9751.4156.0957.9259.9859.8363.3761.5565.5063.1167.02104
blk-pull-soa49.2946.2347.5051.9751.2749.5255.2353.1354.5049.7957.90104
aa-aos91.3566.1476.8084.7683.6391.3693.4692.6293.9192.2592.93145
aa-soa75.5165.6870.9471.3673.8375.4674.8479.4883.2877.7082.72145
aa-vec-soa93.8583.4491.5893.9694.3596.62101.7696.72106.37102.60110.28145
list-push-aos80.2980.9780.9581.1081.3782.4481.7781.4980.7281.9380.9383
list-push-soa47.5242.6545.2846.6443.4640.5944.9446.5541.5345.9844.8683
list-pull-aos85.3082.9786.4383.4286.3383.7086.4383.7783.1085.8984.4483
list-pull-soa62.1263.6163.2861.3266.7262.6564.8260.4958.0164.4662.5283
list-pull-split-nt-1s-soa121.35113.77115.29113.54117.00116.46114.78114.54110.83112.67117.85125
list-pull-split-nt-2s-soa118.09110.48112.55113.18113.44111.85109.27114.41110.28111.78113.74125
list-aa-aos121.28118.63119.00118.50121.99119.11118.83121.47121.62126.18120.12129
list-aa-soa126.34116.90129.45127.12129.41121.42126.19126.76126.70124.40125.22129
list-aa-ria-soa133.68121.82126.04128.46131.15132.25128.78133.50126.69124.40130.37145
list-aa-pv-soa146.22124.39130.73136.29137.61131.21138.65138.78127.02132.40138.37145
-
-
-

6.2   Broadwell, Intel Xeon E5-2630 v4

-
    -
  • Broadwell architecture, AVX2, FMA
  • -
  • 10 cores, 2.2 GHz
  • -
  • SMT disabled
  • -
-

memory bandwidth:

-
    -
  • copy-19 48.0 GB/s
  • -
  • copy-nt-sl-19 48.2 GB/s
  • -
  • update-19 51.1 GB/s
  • -
-

geometry dimensions: 500x100x100

- --------------- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
kernelpipeblocks-2blocks-4blocks-6blocks-8blocks-10blocks-15blocks-16blocks-20blocks-25blocks-32RFM
blk-push-aos55.7547.6254.5757.1058.4959.0061.7260.5664.0561.1066.03105
blk-push-soa30.0631.0932.1332.5432.7432.7233.8133.1934.9033.2135.75105
blk-pull-aos53.8048.6153.0854.9956.0856.6859.2058.1261.4958.7163.45105
blk-pull-soa46.9646.6148.8449.7050.3350.4652.3651.3954.2051.6155.71105
aa-aos91.4066.9978.4783.3886.6288.6292.9891.5497.0894.9398.90168
aa-soa83.0169.9675.8577.7279.0179.2982.3880.1185.7083.9187.69168
aa-vec-soa112.0396.52105.32109.76112.55113.82120.55118.37126.30121.37131.94168
list-push-aos75.1374.1875.2075.4275.2475.9975.8075.8075.5476.2276.2197
list-push-soa40.9938.1439.0038.8938.8939.6739.8739.2839.3540.0840.1397
list-pull-aos82.0782.8883.2983.0983.3283.4982.8282.8883.3282.6082.9397
list-pull-soa62.0760.4061.8961.3962.4360.9060.4862.8062.5061.1060.3897
list-pull-split-nt-1s-soa125.81120.60121.96122.34122.86123.53123.64123.67125.94124.09123.69128
list-pull-split-nt-2s-soa122.79117.16118.86119.16119.56119.99120.01120.03122.64120.57120.39128
list-aa-aos128.13127.41129.31129.07129.79129.63129.67129.94129.12128.41129.72150
list-aa-soa141.60139.78141.58142.16141.94141.31142.37142.25142.43141.40142.26150
list-aa-ria-soa141.82134.88140.15140.72141.67140.51141.18141.29142.97141.94143.25168
list-aa-pv-soa164.79140.95159.24161.78162.40163.04164.69164.38165.11165.75166.09168
-
-
-

6.3   Skylake, Intel Xeon Gold 6148

-
    -
  • Skylake architecture, AVX2, FMA, AVX512
  • -
  • 20 cores, 2.4 GHz
  • -
  • SMT enabled
  • -
-

memory bandwidth:

-
    -
  • copy-19 89.7 GB/s
  • -
  • copy-19-nt-sl 92.4 GB/s
  • -
  • update-19 93.6 GB/s
  • -
-

geometry dimensions: 500x100x100

- --------------- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
kernelpipeblocks-2blocks-4blocks-6blocks-8blocks-10blocks-15blocks-16blocks-20blocks-25blocks-32RFM
blk-push-aos113.0193.99108.98114.65117.87119.47124.95122.46129.29123.87133.01197
blk-push-soa100.2198.87103.63105.56107.02107.27111.61109.83116.16110.51110.29197
blk-pull-aos118.45102.54114.12117.82122.69124.31130.58127.85135.72129.65139.94197
blk-pull-soa82.6083.3687.1388.3988.8488.9692.4890.9395.7991.9298.64197
aa-aos171.32125.43147.73157.70163.35167.25175.39174.20182.54173.67187.76308
aa-soa180.85152.39165.84152.59171.90175.76184.94182.34189.43180.30193.54308
aa-vec-soa208.03181.51195.86203.41209.08212.34224.05219.49234.31225.92245.22308
list-push-aos158.81164.67162.93163.05165.22164.31164.66160.78164.07165.19164.06177
list-push-soa134.60110.44110.17132.01132.95133.46134.37134.33135.12134.91137.87177
list-pull-aos169.61170.03170.89170.90171.20171.60172.09171.95169.48172.08171.02177
list-pull-soa120.50116.73118.62118.00120.99118.15117.17121.41120.83120.00118.74177
list-pull-split-nt-1s-soa225.59224.18225.10226.34226.01230.37227.50228.42227.39231.65227.35246
list-pull-split-nt-2s-soa219.20214.63217.61218.13219.07221.01219.88220.09220.62221.68220.58246
list-aa-aos241.39239.27239.53242.56242.46243.00242.91242.46241.24242.96241.52275
list-aa-soa273.73268.49268.48271.79275.29274.56277.18272.67274.21275.24278.21275
list-aa-ria-soa288.42261.89273.26284.84283.88288.29290.72289.81293.36290.75292.93308
list-aa-pv-soa303.35267.21289.18294.96294.36298.16300.45301.71302.37302.88304.46308
-
-
-
-

7   Licence

-

The Lattice Boltzmann Benchmark Kernels are licensed under GPLv3.

-
-
-

8   Acknowledgements

-

This work was funded by BMBF, grant no. 01IH15003A (project SKAMPY).

-

This work was funded by KONWHIR project OMI4PAPS.

-
-
-

9   Bibliography

- - - - - -
[ginzburg-2008]I. Ginzburg, F. Verhaeghe, and D. d'Humières. -Two-relaxation-time lattice Boltzmann scheme: About parametrization, velocity, pressure and mixed boundary conditions. -Commun. Comput. Phys., 3(2):427-478, 2008.
- - - - - -
[williams-2008]S. Williams, A. Waterman, and D. Patterson. -Roofline: an insightful visual performance model for multicore architectures. -Commun. ACM, 52(4):65-76, Apr 2009. doi:10.1145/1498765.1498785
-

Document was generated at 2017-11-21 15:43.

-
-
- - diff --git a/doc/images/benchmark-emmy-dp.png b/doc/images/benchmark-emmy-dp.png new file mode 100644 index 0000000..4c907f7 Binary files /dev/null and b/doc/images/benchmark-emmy-dp.png differ diff --git a/doc/images/benchmark-emmy-sp.png b/doc/images/benchmark-emmy-sp.png new file mode 100644 index 0000000..560f750 Binary files /dev/null and b/doc/images/benchmark-emmy-sp.png differ diff --git a/doc/images/benchmark-hasep1-dp.png b/doc/images/benchmark-hasep1-dp.png new file mode 100644 index 0000000..0f0a258 Binary files /dev/null and b/doc/images/benchmark-hasep1-dp.png differ diff --git a/doc/images/benchmark-hasep1-sp.png b/doc/images/benchmark-hasep1-sp.png new file mode 100644 index 0000000..4d3139d Binary files /dev/null and b/doc/images/benchmark-hasep1-sp.png differ diff --git a/doc/images/benchmark-meggie-dp.png b/doc/images/benchmark-meggie-dp.png new file mode 100644 index 0000000..dc98dfa Binary files /dev/null and b/doc/images/benchmark-meggie-dp.png differ diff --git a/doc/images/benchmark-meggie-sp.png b/doc/images/benchmark-meggie-sp.png new file mode 100644 index 0000000..de3644c Binary files /dev/null and b/doc/images/benchmark-meggie-sp.png differ diff --git a/doc/images/benchmark-naples1-dp.png b/doc/images/benchmark-naples1-dp.png new file mode 100644 index 0000000..736d65d Binary files /dev/null and b/doc/images/benchmark-naples1-dp.png differ diff --git a/doc/images/benchmark-naples1-sp.png b/doc/images/benchmark-naples1-sp.png new file mode 100644 index 0000000..96aa9b5 Binary files /dev/null and b/doc/images/benchmark-naples1-sp.png differ diff --git a/doc/images/benchmark-skylakesp2-dp.png b/doc/images/benchmark-skylakesp2-dp.png new file mode 100644 index 0000000..974cbcb Binary files /dev/null and b/doc/images/benchmark-skylakesp2-dp.png differ diff --git a/doc/images/benchmark-skylakesp2-sp.png b/doc/images/benchmark-skylakesp2-sp.png new file mode 100644 index 0000000..583b053 Binary files /dev/null and b/doc/images/benchmark-skylakesp2-sp.png differ diff --git a/doc/images/benchmark-summitridge1-dp.png b/doc/images/benchmark-summitridge1-dp.png new file mode 100644 index 0000000..a3d922f Binary files /dev/null and b/doc/images/benchmark-summitridge1-dp.png differ diff --git a/doc/images/benchmark-summitridge1-sp.png b/doc/images/benchmark-summitridge1-sp.png new file mode 100644 index 0000000..c5524cb Binary files /dev/null and b/doc/images/benchmark-summitridge1-sp.png differ diff --git a/doc/main.html b/doc/main.html new file mode 100644 index 0000000..9f11866 --- /dev/null +++ b/doc/main.html @@ -0,0 +1,1239 @@ + + + + + + +LBM Benchmark Kernels Documentation + + + + +
+ + +
+
Copyright
+
+
Markus Wittmann, 2016-2018
+
RRZE, University of Erlangen-Nuremberg, Germany
+
markus.wittmann -at- fau.de or hpc -at- rrze.fau.de
+

+
Viktor Haag, 2016
+
LSS, University of Erlangen-Nuremberg, Germany
+

+
+
This file is part of the Lattice Boltzmann Benchmark Kernels (LbmBenchKernels).
+

+
LbmBenchKernels is free software: you can redistribute it and/or modify
+
it under the terms of the GNU General Public License as published by
+
the Free Software Foundation, either version 3 of the License, or
+
(at your option) any later version.
+

+
LbmBenchKernels is distributed in the hope that it will be useful,
+
but WITHOUT ANY WARRANTY; without even the implied warranty of
+
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+
GNU General Public License for more details.
+

+
You should have received a copy of the GNU General Public License
+
along with LbmBenchKernels. If not, see <http://www.gnu.org/licenses/>.
+
+

LBM Benchmark Kernels Documentation

+ +
+

1   Introduction

+

The lattice Boltzmann (LBM) benchmark kernels are a collection of LBM kernel +implementations.

+

AS SUCH THE LBM BENCHMARK KERNELS ARE NO FULLY EQUIPPED CFD SOLVER AND SOLELY +SERVES THE PURPOSE OF STUDYING POSSIBLE PERFORMANCE OPTIMIZATIONS AND/OR +EXPERIMENTS.

+

Currently all kernels utilize a D3Q19 discretization and the +two-relaxation-time (TRT) collision operator [ginzburg-2008]. +All operations are carried out in double or single precision arithmetic.

+
+
+

2   Compilation

+

The benchmark framework currently supports only Linux systems and the GCC and +Intel compilers. Every other configuration probably requires adjustment inside +the code and the makefiles. Furthermore some code might be platform or at least +POSIX specific.

+

The benchmark can be build via make from the src subdirectory. This will +generate one binary which hosts all implemented benchmark kernels.

+

Binaries are located under the bin subdirectory and will have different names +depending on compiler and build configuration.

+

Compilation can target debug or release builds. Combined with both build types +verification can be enabled, which increases the runtime and hence is not +suited for benchmarking.

+
+

2.1   Debug and Verification

+
+make BUILD=debug BENCHMARK=off
+
+

Running make with BUILD=debug builds the debug version of +the benchmark kernels, where no optimizations are performed, line numbers and +debug symbols are included as well as DEBUG will be defined. The resulting +binary will be found in the bin subdirectory and named +lbmbenchk-linux-<compiler>-debug.

+

Specifying BENCHMARK=off turns on verification +(VERIFICATION=on), statistics (STATISTICS=on), and VTK output +(VTK_OUTPUT=on) enabled.

+

Please note that the generated binary will therefore +exhibit a poor performance.

+
+
+

2.2   Release and Verification

+

Verification with the debug builds can be extremely slow. Hence verification +capabilities can be build with release builds:

+
+make BENCHMARK=off
+
+
+
+

2.3   Benchmarking

+

To generate a binary for benchmarking run make with

+
+make
+
+

As default BENCHMARK=on and BUILD=release is set, where +BUILD=release turns optimizations on and BENCHMARK=on disables +verfification, statistics, and VTK output.

+

See Options Summary below for further description of options which can be +applied, e.g. TARCH as well as the Benchmarking section.

+
+
+

2.4   Compilers

+

Currently only the GCC and Intel compiler under Linux are supported. Between +both configuration can be chosen via CONFIG=linux-gcc or +CONFIG=linux-intel.

+
+
+

2.5   Floating Point Precision

+

As default double precision data types are used for storing PDFs and floating +point constants. Furthermore, this is the default for the intrincis kernels. +With the PRECISION=sp variable this can be changed to single precision.

+
+make PRECISION=sp   # build for single precision kernels
+
+make PRECISION=dp   # build for double precision kernels (defalt)
+
+
+
+

2.6   Cleaning

+

For each configuration and build (debug/release) a subdirectory under the +src/obj directory is created where the dependency and object files are +stored. +With

+
+make CONFIG=... BUILD=... clean
+
+

a specific combination is select and cleaned, whereas with

+
+make clean-all
+
+

all object and dependency files are deleted.

+
+
+

2.7   Options Summary

+

Options that can be specified when building the suite with make:

+ ++++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
namevaluesdefaultdescription
BENCHMARKon, offonIf enabled, disables VERIFICATION, STATISTICS, VTK_OUTPUT. If disabled enables the three former options.
BUILDdebug, releasereleasedebug: no optimization, debug symbols, DEBUG defined. release: optimizations enabled.
CONFIGlinux-gcc, linux-intellinux-intelSelect GCC or Intel compiler.
ISAavx, sseavxDetermines which ISA extension is used for macro definitions of the intrinsics. This is not the architecture the compiler generates code for.
OPENMPon, offonOpenMP, i.,e.. threading support.
PRECISIONdp, spdpFloating point precision used for data type, arithmetic, and intrincics.
STATISTICSon, offoffView statistics, like density etc, during simulation.
TARCH----Via TARCH the architecture the compiler generates code for can be overridden. The value depends on the chosen compiler.
VERIFICATIONon, offoffTurn verification on/off.
VTK_OUTPUTon, offoffEnable/Disable VTK file output.
+
+
+
+

3   Invocation

+

Running the binary will print among the GPL licence header a line like the following:

+
+LBM Benchmark Kernels 0.1, compiled Jul  5 2017 21:59:22, type: verification
+
+

if verfication was enabled during compilation or

+
+LBM Benchmark Kernels 0.1, compiled Jul  5 2017 21:59:22, type: benchmark
+
+

if verfication was disabled during compilation.

+
+

3.1   Command Line Parameters

+

Running the binary with -h list all available parameters:

+
+Usage:
+./lbmbenchk -list
+./lbmbenchk
+    [-dims XxYyZ] [-geometry box|channel|pipe|blocks[-<block size>]] [-iterations <iterations>] [-lattice-dump-ascii]
+    [-rho-in <density>] [-rho-out <density] [-omega <omega>] [-kernel <kernel>]
+    [-periodic-x]
+    [-t <number of threads>]
+    [-pin core{,core}*]
+    [-verify]
+    -- <kernel specific parameters>
+
+-list           List available kernels.
+
+-dims XxYxZ     Specify geometry dimensions.
+
+-geometry blocks-<block size>
+                Geometetry with blocks of size <block size> regularily layout out.
+
+

If an option is specified multiple times the last one overrides previous ones. +This holds also true for -verify which sets geometry dimensions, +iterations, etc, which can afterward be override, e.g.:

+
+$ bin/lbmbenchk-linux-intel-release-dp -verfiy -dims 32x32x32
+
+

Kernel specific parameters can be obtained via selecting the specific kernel +and passing -h as parameter:

+
+$ bin/lbmbenchk-linux-intel-release-dp -kernel kernel-name -- -h
+...
+Kernel parameters:
+[-blk <n>] [-blk-[xyz] <n>]
+
+

A list of all available kernels can be obtained via -list:

+
+$ ../bin/lbmbenchk-linux-gcc-debug-dp -list
+Lattice Boltzmann Benchmark Kernels (LbmBenchKernels) Copyright (C) 2016, 2017 LSS, RRZE
+This program comes with ABSOLUTELY NO WARRANTY; for details see LICENSE.
+This is free software, and you are welcome to redistribute it under certain conditions.
+
+LBM Benchmark Kernels 0.1, compiled Jul  5 2017 21:59:22, type: verification
+Available kernels to benchmark:
+   list-aa-pv-soa
+   list-aa-ria-soa
+   list-aa-soa
+   list-aa-aos
+   list-pull-split-nt-1s-soa
+   list-pull-split-nt-2s-soa
+   list-push-soa
+   list-push-aos
+   list-pull-soa
+   list-pull-aos
+   push-soa
+   push-aos
+   pull-soa
+   pull-aos
+   blk-push-soa
+   blk-push-aos
+   blk-pull-soa
+   blk-pull-aos
+
+
+
+

3.2   Kernels

+

The following list shortly describes available kernels:

+
    +
  • push-soa/push-aos/pull-soa/pull-aos: +Unoptimized kernels (but stream/collide are already fused) using two grids as +source and destination. Implement push/pull semantics as well structure of +arrays (soa) or array of structures (aos) layout.
  • +
  • blk-push-soa/blk-push-aos/blk-pull-soa/blk-pull-aos: +The same as the unoptimized kernels without the blk prefix, except that they support +spatial blocking, i.e. loop blocking of the three loops used to iterate over +the lattice. Here manual work sharing for OpenMP is used.
  • +
  • aa-aos/aa-soa: +Straight forward implementation of AA pattern on full array with blocking support. +Manual work sharing for OpenMP is used. Domain is partitioned only along the x dimension.
  • +
  • aa-vec-soa/aa-vec-sl-soa: +Optimized AA kernel with intrinsics on full array. aa-vec-sl-soa uses only +one loop for iterating over the lattice instead of three nested ones.
  • +
  • list-push-soa/list-push-aos/list-pull-soa/list-pull-aos: +The same as the unoptimized kernels without the list prefix, but for indirect addressing. +Here only a 1D vector of is used to store the fluid nodes, omitting the +obstacles. An adjacency list is used to recover the neighborhood associations.
  • +
  • list-pull-split-nt-1s-soa/list-pull-split-nt-2s-soa: +Optimized variant of list-pull-soa. Chunks of the lattice are processed as +once. Postcollision values are written back via nontemporal stores in 18 (1s) +or 9 (2s) loops.
  • +
  • list-aa-aos/list-aa-soa: +Unoptimized implementation of the AA pattern for the 1D vector with adjacency +list. Supported are array of structures (aos) and structure of arrays (soa) +data layout is supported.
  • +
  • list-aa-ria-soa: +Implementation of AA pattern with intrinsics for the 1D vector with adjacency +list. Furthermore it contains a vectorized even time step and run length +coding to reduce the loop balance of the odd time step.
  • +
  • list-aa-pv-soa: +All optimizations of list-aa-ria-soa. Additional with partial vectorization +of the odd time step.
  • +
+

Note that all array of structures (aos) kernels might require blocking +(depending on the domain size) to reach the performance of their structure of +arrays (soa) counter parts.

+

The following table summarizes the properties of the kernels. Here D means +direct addressing, i.e. full array, I means indirect addressing, i.e. 1D +vector with adjacency list, x means supported, whereas -- means unsupported. +The loop balance B_l is computed for D3Q19 model with double precision floating +point for PDFs (8 byte) and 4 byte integers for the index (adjacency list). +As list-aa-ria-soa and list-aa-pv-soa support run length coding their effective +loop balance depends on the geometry. The effective loop balance is printed +during each run.

+ +++++++++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
kernel nameprop. stepdata layoutaddr.parallelblockingB_l [B/FLUP]
push-soaOSSoADx--456
push-aosOSAoSDx--456
pull-soaOSSoADx--456
pull-aosOSAoSDx--456
blk-push-soaOSSoADxx456
blk-push-aosOSAoSDxx456
blk-pull-soaOSSoADxx456
blk-pull-aosOSAoSDxx456
aa-soaAASoADxx304
aa-aosAAAoSDxx304
aa-vec-soaAASoADxx304
aa-vec-sl-soaAASoADxx304
list-push-soaOSSoAIxx528
list-push-aosOSAoSIxx528
list-pull-soaOSSoAIxx528
list-pull-aosOSAoSIxx528
list-pull-split-nt-1sOSSoAIxx376
list-pull-split-nt-2sOSSoAIxx376
list-aa-soaAASoAIxx340
list-aa-aosAAAoSIxx340
list-aa-ria-soaAASoAIxx304-342
list-aa-pv-soaAASoAIxx304-342
+
+
+
+

4   Benchmarking

+

Correct benchmarking is a nontrivial task. Whenever benchmark results should be +created make sure the binary was compiled with:

+
    +
  • BENCHMARK=on (default if not overriden) and
  • +
  • BUILD=release (default if not overriden) and
  • +
  • the correct ISA for macros is used, selected via ISA and
  • +
  • use TARCH to specify the architecture the compiler generates code for.
  • +
+
+

4.1   Intel Compiler

+

For the Intel compiler one can specify depending on the target ISA extension:

+
    +
  • AVX: TARCH=-xAVX
  • +
  • AVX2 and FMA: TARCH=-xCORE-AVX2,-fma
  • +
  • AVX512: TARCH=-xCORE-AVX512
  • +
  • KNL: TARCH=-xMIC-AVX512
  • +
+

Compiling for an architecture supporting AVX (Sandy Bridge, Ivy Bridge):

+
+make ISA=avx TARCH=-xAVX
+
+

Compiling for an architecture supporting AVX2 (Haswell, Broadwell):

+
+make ISA=avx TARCH=-xCORE-AVX2,-fma
+
+

WARNING: ISA is here still set to avx as currently we have the FMA intrinsics not +implemented. This might change in the future.

+

Compiling for an architecture supporting AVX-512 (Skylake):

+
+make ISA=avx TARCH=-xCORE-AVX512
+
+

WARNING: ISA is here still set to avx as currently we have no implementation for the +AVX512 intrinsics. This might change in the future.

+
+
+

4.2   Pinning

+

During benchmarking pinning should be used via the -pin parameter. Running +a benchmark with 10 threads and pin them to the first 10 cores works like

+
+$ bin/lbmbenchk-linux-intel-release-dp ... -t 10 -pin $(seq -s , 0 9)
+
+
+
+

4.3   General Remarks

+

Things the binary does nor check or control:

+
    +
  • transparent huge pages: when allocating memory small 4 KiB pages might be +replaced with larger ones. This is in general a good thing, but if this is +really the case, depends on the system settings (check e.g. the status of +/sys/kernel/mm/transparent_hugepage/enabled). +Currently madvise(MADV_HUGEPAGE) is used for allocations which are aligned to +a 4 KiB page, which should be the case for the lattices. +This should result in huge pages except THP is disabled on the machine. +(NOTE: madvise() is used if HAVE_HUGE_PAGES is defined, which is currently +hard coded defined in Memory.c).
  • +
  • CPU/core frequency: For reproducible results the frequency of all cores +should be fixed.
  • +
  • NUMA placement policy: The benchmark assumes a first touch policy, which +means the memory will be placed at the NUMA domain the touching core is +associated with. If a different policy is in place or the NUMA domain to be +used is already full memory might be allocated in a remote domain. Accesses +to remote domains typically have a higher latency and lower bandwidth.
  • +
  • System load: interference with other application, especially on desktop +systems should be avoided.
  • +
  • Padding: For SoA based kernels the number of (fluid) nodes is automatically +adjusted so that no cache or TLB thrashing should occur. The parameters are +optimized for current Intel based systems. For more details look into the +padding section.
  • +
  • CPU dispatcher function: the compiler might add different versions of a +function for different ISA extensions. Make sure the code you might think is +executed is actually the code which is executed.
  • +
+
+
+

4.4   Padding

+

With correct padding cache and TLB thrashing can be avoided. Therefore the +number of (fluid) nodes used in the data layout is artificially increased.

+

Currently automatic padding is active for kernels which support it. It can be +controlled via the kernel parameter (i.e. parameter after the --) +-pad. Supported values are auto (default), no (to disable padding), +or a manual padding.

+

Automatic padding tries to avoid cache and TLB thrashing and pads for a 32 +entry (huge pages) TLB with 8 sets and a 512 set (L2) cache. This reflects the +parameters of current Intel based processors.

+

Manual padding is done via a padding string and has the format +mod_1+offset_1(,mod_n+offset_n), which specifies numbers of bytes. +SoA data layouts can exhibit TLB thrashing. Therefore we want to distribute the +19 pages with one lattice (36 with two lattices) we are concurrently accessing +over as much sets in the TLB as possible. +This is controlled by the distance between the accessed pages, which is the +number of (fluid) nodes in between them and can be adjusted by adding further +(fluid) nodes. +We want the distance d (in bytes) between two accessed pages to be e.g. +d % (PAGE_SIZE * TLB_SETS) = PAGE_SIZE. +This would distribute the pages evenly over the sets. Hereby PAGE_SIZE * TLB_SETS +would be our mod_1 and PAGE_SIZE (after the =) our offset_1. +Measurements show that with only a quarter of half of a page size as offset +higher performance is achieved, which is done by automatic padding. +On top of this padding more paddings can be added. They are just added to the +padding string and are separated by commas.

+

A zero modulus in the padding string has a special meaning. Here the +corresponding offset is just added to the number of nodes. A padding string +like -pad 0+16 would at a static padding of two nodes (one node = 8 b).

+
+
+
+

5   Geometries

+

TODO: supported geometries: channel, pipe, blocks, fluid

+
+
+

6   Performance Results

+

The sections lists performance values measured on several machines for +different kernels and geometries and double precision floating point data/arithmetic. +The RFM column denotes the expected performance as predicted by the +Roofline performance model [williams-2008]. +For performance prediction of each kernel a memory bandwidth benchmark is used +which mimics the kernels memory access pattern and the kernel's loop balance +(see [kernels] for details).

+
+

6.1   Machine Specifications

+

Ivy Bridge, Intel Xeon E5-2660 v2

+
    +
  • Ivy Bridge architecture, AVX
  • +
  • 10 cores, 2.2 GHz
  • +
  • SMT enabled
  • +
  • memoy bandwidth:
      +
    • copy-19 32.7 GB/s
    • +
    • copy-19-nt-sl 35.6 GB/s
    • +
    • update-19 37.4 GB/s
    • +
    +
  • +
+

Haswell, Intel Xeon E5-2695 v3

+
    +
  • Haswell architecture, AVX2, FMA
  • +
  • 14 cores, 2.3 GHz
  • +
  • 2 x 7 cores in cluster-on-die (CoD) mode enabled
  • +
  • SMT enabled
  • +
  • memory bandwidth:
      +
    • copy-19 47.3 GB/s
    • +
    • copy-19-nt-sl 47.1 GB/s
    • +
    • update-19 44.0 GB/s
    • +
    +
  • +
+

Broadwell, Intel Xeon E5-2630 v4

+
    +
  • Broadwell architecture, AVX2, FMA
  • +
  • 10 cores, 2.2 GHz
  • +
  • SMT disabled
  • +
  • memory bandwidth:
      +
    • copy-19 48.0 GB/s
    • +
    • copy-nt-sl-19 48.2 GB/s
    • +
    • update-19 51.1 GB/s
    • +
    +
  • +
+

Skylake, Intel Xeon Gold 6148

+

NOTE: currently we only use AVX2 intrinsics.

+
    +
  • Skylake server architecture, AVX2, AVX512, 2 FMA units
  • +
  • 20 cores, 2.4 GHz
  • +
  • SMT enabled
  • +
  • memory bandwidth:
      +
    • copy-19 89.7 GB/s
    • +
    • copy-19-nt-sl 92.4 GB/s
    • +
    • update-19 93.6 GB/s
    • +
    +
  • +
+

Zen, AMD EPYC 7451

+
    +
  • Zen architecture, AVX2, FMA
  • +
  • 24 cores, 2.3 GHz
  • +
  • SMT enabled
  • +
  • memory bandwidth:
      +
    • copy-19 111.9 GB/s
    • +
    • copy-19-nt-sl 111.7 GB/s
    • +
    • update-19 109.2 GB/s
    • +
    +
  • +
+

Zen, AMD Ryzen 7 1700X

+
    +
  • Zen architecture, AVX2, FMA
  • +
  • 8 cores, 3.4 GHz
  • +
  • SMT enabled
  • +
  • memory bandwidth:
      +
    • copy-19 27.2 GB/s
    • +
    • copy-19-nt-sl 27.1 GB/s
    • +
    • update-19 26.1 GB/s
    • +
    +
  • +
+
+
+

6.2   Single Socket Results

+
    +
  • Geometry dimensions are for all measurements 500x100x100 nodes.
  • +
  • Note the different scaling on the y axis of the plots!
  • +
+ +++ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Ivy Bridge, Intel Xeon E5-2660 v2, Double Precision
perf_emmy_dp
Ivy Bridge, Intel Xeon E5-2660 v2, Single Precision
perf_emmy_sp
Haswell, Intel Xeon E5-2695 v3, Double Precision
perf_hasep1_dp
Haswell, Intel Xeon E5-2695 v3, Single Precision
perf_hasep1_sp
Broadwell, Intel Xeon E5-2630 v4, Double Precision
perf_meggie_dp
Broadwell, Intel Xeon E5-2630 v4, Single Precision
perf_meggie_sp
Skylake, Intel Xeon Gold 6148, Double Precision, NOTE: currently we only use AVX2 intrinsics.
perf_skylakesp2_dp
Skylake, Intel Xeon Gold 6148, Single Precision, NOTE: currently we only use AVX2 intrinsics.
perf_skylakesp2_sp
Zen, AMD Ryzen 7 1700X, Double Precision
perf_summitridge1_dp
Zen, AMD Ryzen 7 1700X, Single Precision
perf_summitridge1_sp
Zen, AMD EPYC 7451, Double Precision
perf_naples1_dp
Zen, AMD EPYC 7451, Single Precision
perf_naples1_sp
+
+
+
+

7   Licence

+

The Lattice Boltzmann Benchmark Kernels are licensed under GPLv3.

+
+
+

8   Acknowledgements

+

This work was funded by BMBF, grant no. 01IH15003A (project SKAMPY).

+

This work was funded by KONWHIR project OMI4PAPS.

+
+
+

9   Bibliography

+ + + + + +
[ginzburg-2008]I. Ginzburg, F. Verhaeghe, and D. d'Humières. +Two-relaxation-time lattice Boltzmann scheme: About parametrization, velocity, pressure and mixed boundary conditions. +Commun. Comput. Phys., 3(2):427-478, 2008.
+ + + + + +
[williams-2008]S. Williams, A. Waterman, and D. Patterson. +Roofline: an insightful visual performance model for multicore architectures. +Commun. ACM, 52(4):65-76, Apr 2009. doi:10.1145/1498765.1498785
+

Document was generated at 2018-01-09 11:54.

+
+
+ + diff --git a/doc/main.rst b/doc/main.rst index 0eaa5ed..3d9ca9f 100644 --- a/doc/main.rst +++ b/doc/main.rst @@ -1,36 +1,31 @@ -.. # -------------------------------------------------------------------------- - # - # Copyright - # Markus Wittmann, 2016-2017 - # RRZE, University of Erlangen-Nuremberg, Germany - # markus.wittmann -at- fau.de or hpc -at- rrze.fau.de - # - # Viktor Haag, 2016 - # LSS, University of Erlangen-Nuremberg, Germany - # - # This file is part of the Lattice Boltzmann Benchmark Kernels (LbmBenchKernels). - # - # LbmBenchKernels is free software: you can redistribute it and/or modify - # it under the terms of the GNU General Public License as published by - # the Free Software Foundation, either version 3 of the License, or - # (at your option) any later version. - # - # LbmBenchKernels is distributed in the hope that it will be useful, - # but WITHOUT ANY WARRANTY; without even the implied warranty of - # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the - # GNU General Public License for more details. - # - # You should have received a copy of the GNU General Public License - # along with LbmBenchKernels. If not, see . - # - # -------------------------------------------------------------------------- + +| Copyright +| Markus Wittmann, 2016-2018 +| RRZE, University of Erlangen-Nuremberg, Germany +| markus.wittmann -at- fau.de or hpc -at- rrze.fau.de +| +| Viktor Haag, 2016 +| LSS, University of Erlangen-Nuremberg, Germany +| +| This file is part of the Lattice Boltzmann Benchmark Kernels (LbmBenchKernels). +| +| LbmBenchKernels is free software: you can redistribute it and/or modify +| it under the terms of the GNU General Public License as published by +| the Free Software Foundation, either version 3 of the License, or +| (at your option) any later version. +| +| LbmBenchKernels is distributed in the hope that it will be useful, +| but WITHOUT ANY WARRANTY; without even the implied warranty of +| MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +| GNU General Public License for more details. +| +| You should have received a copy of the GNU General Public License +| along with LbmBenchKernels. If not, see . .. title:: LBM Benchmark Kernels Documentation -=================================== -LBM Benchmark Kernels Documentation -=================================== +**LBM Benchmark Kernels Documentation** .. sectnum:: .. contents:: @@ -47,7 +42,7 @@ EXPERIMENTS.** Currently all kernels utilize a D3Q19 discretization and the two-relaxation-time (TRT) collision operator [ginzburg-2008]_. -All operations are carried out in double precision arithmetic. +All operations are carried out in double or single precision arithmetic. Compilation =========== @@ -120,6 +115,18 @@ both configuration can be chosen via ``CONFIG=linux-gcc`` or ``CONFIG=linux-intel``. +Floating Point Precision +------------------------ + +As default double precision data types are used for storing PDFs and floating +point constants. Furthermore, this is the default for the intrincis kernels. +With the ``PRECISION=sp`` variable this can be changed to single precision. :: + + make PRECISION=sp # build for single precision kernels + + make PRECISION=dp # build for double precision kernels (defalt) + + Cleaning -------- @@ -150,6 +157,7 @@ BUILD debug, release release debug: no optimization, debug CONFIG linux-gcc, linux-intel linux-intel Select GCC or Intel compiler. ISA avx, sse avx Determines which ISA extension is used for macro definitions of the intrinsics. This is *not* the architecture the compiler generates code for. OPENMP on, off on OpenMP, i.\,e.\. threading support. +PRECISION dp, sp dp Floating point precision used for data type, arithmetic, and intrincics. STATISTICS on, off off View statistics, like density etc, during simulation. TARCH -- -- Via TARCH the architecture the compiler generates code for can be overridden. The value depends on the chosen compiler. VERIFICATION on, off off Turn verification on/off. @@ -197,12 +205,12 @@ If an option is specified multiple times the last one overrides previous ones. This holds also true for ``-verify`` which sets geometry dimensions, iterations, etc, which can afterward be override, e.g.: :: - $ bin/lbmbenchk-linux-intel-release -verfiy -dims 32x32x32 + $ bin/lbmbenchk-linux-intel-release-dp -verfiy -dims 32x32x32 Kernel specific parameters can be obtained via selecting the specific kernel and passing ``-h`` as parameter: :: - $ bin/lbmbenchk-linux-intel-release -kernel kernel-name -- -h + $ bin/lbmbenchk-linux-intel-release-dp -kernel kernel-name -- -h ... Kernel parameters: [-blk ] [-blk-[xyz] ] @@ -210,7 +218,7 @@ and passing ``-h`` as parameter: :: A list of all available kernels can be obtained via ``-list``: :: - $ ../bin/lbmbenchk-linux-gcc-debug -list + $ ../bin/lbmbenchk-linux-gcc-debug-dp -list Lattice Boltzmann Benchmark Kernels (LbmBenchKernels) Copyright (C) 2016, 2017 LSS, RRZE This program comes with ABSOLUTELY NO WARRANTY; for details see LICENSE. This is free software, and you are welcome to redistribute it under certain conditions. @@ -241,37 +249,45 @@ Kernels The following list shortly describes available kernels: -- push-soa/push-aos/pull-soa/pull-aos: +- **push-soa/push-aos/pull-soa/pull-aos**: Unoptimized kernels (but stream/collide are already fused) using two grids as source and destination. Implement push/pull semantics as well structure of arrays (soa) or array of structures (aos) layout. -- blk-push-soa/blk-push-aos/blk-pull-soa/blk-pull-aos: +- **blk-push-soa/blk-push-aos/blk-pull-soa/blk-pull-aos**: The same as the unoptimized kernels without the blk prefix, except that they support spatial blocking, i.e. loop blocking of the three loops used to iterate over the lattice. Here manual work sharing for OpenMP is used. -- list-push-soa/list-push-aos/list-pull-soa/list-pull-aos: +- **aa-aos/aa-soa**: + Straight forward implementation of AA pattern on full array with blocking support. + Manual work sharing for OpenMP is used. Domain is partitioned only along the x dimension. + +- **aa-vec-soa/aa-vec-sl-soa**: + Optimized AA kernel with intrinsics on full array. aa-vec-sl-soa uses only + one loop for iterating over the lattice instead of three nested ones. + +- **list-push-soa/list-push-aos/list-pull-soa/list-pull-aos**: The same as the unoptimized kernels without the list prefix, but for indirect addressing. Here only a 1D vector of is used to store the fluid nodes, omitting the obstacles. An adjacency list is used to recover the neighborhood associations. -- list-pull-split-nt-1s-soa/list-pull-split-nt-2s-soa: +- **list-pull-split-nt-1s-soa/list-pull-split-nt-2s-soa**: Optimized variant of list-pull-soa. Chunks of the lattice are processed as once. Postcollision values are written back via nontemporal stores in 18 (1s) or 9 (2s) loops. -- list-aa-aos/list-aa-soa: +- **list-aa-aos/list-aa-soa**: Unoptimized implementation of the AA pattern for the 1D vector with adjacency list. Supported are array of structures (aos) and structure of arrays (soa) data layout is supported. -- list-aa-ria-soa: +- **list-aa-ria-soa**: Implementation of AA pattern with intrinsics for the 1D vector with adjacency list. Furthermore it contains a vectorized even time step and run length coding to reduce the loop balance of the odd time step. -- list-aa-pv-soa: +- **list-aa-pv-soa**: All optimizations of list-aa-ria-soa. Additional with partial vectorization of the odd time step. @@ -283,7 +299,7 @@ arrays (soa) counter parts. The following table summarizes the properties of the kernels. Here **D** means direct addressing, i.e. full array, **I** means indirect addressing, i.e. 1D vector with adjacency list, **x** means supported, whereas **--** means unsupported. -The loop balance B_l is computed for D3Q19 model with double precision floating +The loop balance B_l is computed for D3Q19 model with **double precision** floating point for PDFs (8 byte) and 4 byte integers for the index (adjacency list). As list-aa-ria-soa and list-aa-pv-soa support run length coding their effective loop balance depends on the geometry. The effective loop balance is printed @@ -301,6 +317,10 @@ blk-push-soa OS SoA D x x 456 blk-push-aos OS AoS D x x 456 blk-pull-soa OS SoA D x x 456 blk-pull-aos OS AoS D x x 456 +aa-soa AA SoA D x x 304 +aa-aos AA AoS D x x 304 +aa-vec-soa AA SoA D x x 304 +aa-vec-sl-soa AA SoA D x x 304 list-push-soa OS SoA I x x 528 list-push-aos OS AoS I x x 528 list-pull-soa OS SoA I x x 528 @@ -361,7 +381,7 @@ Pinning During benchmarking pinning should be used via the ``-pin`` parameter. Running a benchmark with 10 threads and pin them to the first 10 cores works like :: - $ bin/lbmbenchk-linux-intel-release ... -t 10 -pin $(seq -s , 0 9) + $ bin/lbmbenchk-linux-intel-release-dp ... -t 10 -pin $(seq -s , 0 9) General Remarks @@ -447,127 +467,144 @@ Performance Results =================== The sections lists performance values measured on several machines for -different kernels and geometries. +different kernels and geometries and **double precision** floating point data/arithmetic. The **RFM** column denotes the expected performance as predicted by the Roofline performance model [williams-2008]_. For performance prediction of each kernel a memory bandwidth benchmark is used which mimics the kernels memory access pattern and the kernel's loop balance (see [kernels]_ for details). -Haswell, Intel Xeon E5-2695 v3 ------------------------------- +Machine Specifications +---------------------- + +**Ivy Bridge, Intel Xeon E5-2660 v2** + +- Ivy Bridge architecture, AVX +- 10 cores, 2.2 GHz +- SMT enabled +- memoy bandwidth: + + - copy-19 32.7 GB/s + - copy-19-nt-sl 35.6 GB/s + - update-19 37.4 GB/s + +**Haswell, Intel Xeon E5-2695 v3** - Haswell architecture, AVX2, FMA -- 14 cores, 2,3 GHz +- 14 cores, 2.3 GHz - 2 x 7 cores in cluster-on-die (CoD) mode enabled - SMT enabled +- memory bandwidth: -memory bandwidth: - -- copy-19 47.3 GB/s -- copy-19-nt-sl 47.1 GB/s -- update-19 44.0 GB/s - -geometry dimensions: 500x100x100 - -========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ===== -kernel pipe blocks-2 blocks-4 blocks-6 blocks-8 blocks-10 blocks-15 blocks-16 blocks-20 blocks-25 blocks-32 RFM -========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ===== -blk-push-aos 58.82 49.85 57.34 59.90 61.37 62.17 65.30 64.00 67.54 64.46 69.69 104 -blk-push-soa 32.32 33.46 34.02 34.64 35.06 35.04 36.31 35.44 37.20 35.14 37.95 104 -blk-pull-aos 56.97 51.41 56.09 57.92 59.98 59.83 63.37 61.55 65.50 63.11 67.02 104 -blk-pull-soa 49.29 46.23 47.50 51.97 51.27 49.52 55.23 53.13 54.50 49.79 57.90 104 -aa-aos 91.35 66.14 76.80 84.76 83.63 91.36 93.46 92.62 93.91 92.25 92.93 145 -aa-soa 75.51 65.68 70.94 71.36 73.83 75.46 74.84 79.48 83.28 77.70 82.72 145 -aa-vec-soa 93.85 83.44 91.58 93.96 94.35 96.62 101.76 96.72 106.37 102.60 110.28 145 -list-push-aos 80.29 80.97 80.95 81.10 81.37 82.44 81.77 81.49 80.72 81.93 80.93 83 -list-push-soa 47.52 42.65 45.28 46.64 43.46 40.59 44.94 46.55 41.53 45.98 44.86 83 -list-pull-aos 85.30 82.97 86.43 83.42 86.33 83.70 86.43 83.77 83.10 85.89 84.44 83 -list-pull-soa 62.12 63.61 63.28 61.32 66.72 62.65 64.82 60.49 58.01 64.46 62.52 83 -list-pull-split-nt-1s-soa 121.35 113.77 115.29 113.54 117.00 116.46 114.78 114.54 110.83 112.67 117.85 125 -list-pull-split-nt-2s-soa 118.09 110.48 112.55 113.18 113.44 111.85 109.27 114.41 110.28 111.78 113.74 125 -list-aa-aos 121.28 118.63 119.00 118.50 121.99 119.11 118.83 121.47 121.62 126.18 120.12 129 -list-aa-soa 126.34 116.90 129.45 127.12 129.41 121.42 126.19 126.76 126.70 124.40 125.22 129 -list-aa-ria-soa 133.68 121.82 126.04 128.46 131.15 132.25 128.78 133.50 126.69 124.40 130.37 145 -list-aa-pv-soa 146.22 124.39 130.73 136.29 137.61 131.21 138.65 138.78 127.02 132.40 138.37 145 -========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ===== - - -Broadwell, Intel Xeon E5-2630 v4 --------------------------------- + - copy-19 47.3 GB/s + - copy-19-nt-sl 47.1 GB/s + - update-19 44.0 GB/s + + +**Broadwell, Intel Xeon E5-2630 v4** - Broadwell architecture, AVX2, FMA - 10 cores, 2.2 GHz - SMT disabled +- memory bandwidth: + + - copy-19 48.0 GB/s + - copy-nt-sl-19 48.2 GB/s + - update-19 51.1 GB/s + +**Skylake, Intel Xeon Gold 6148** -memory bandwidth: - -- copy-19 48.0 GB/s -- copy-nt-sl-19 48.2 GB/s -- update-19 51.1 GB/s - -geometry dimensions: 500x100x100 - -========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ======= -kernel pipe blocks-2 blocks-4 blocks-6 blocks-8 blocks-10 blocks-15 blocks-16 blocks-20 blocks-25 blocks-32 RFM -========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ======= -blk-push-aos 55.75 47.62 54.57 57.10 58.49 59.00 61.72 60.56 64.05 61.10 66.03 105 -blk-push-soa 30.06 31.09 32.13 32.54 32.74 32.72 33.81 33.19 34.90 33.21 35.75 105 -blk-pull-aos 53.80 48.61 53.08 54.99 56.08 56.68 59.20 58.12 61.49 58.71 63.45 105 -blk-pull-soa 46.96 46.61 48.84 49.70 50.33 50.46 52.36 51.39 54.20 51.61 55.71 105 -aa-aos 91.40 66.99 78.47 83.38 86.62 88.62 92.98 91.54 97.08 94.93 98.90 168 -aa-soa 83.01 69.96 75.85 77.72 79.01 79.29 82.38 80.11 85.70 83.91 87.69 168 -aa-vec-soa 112.03 96.52 105.32 109.76 112.55 113.82 120.55 118.37 126.30 121.37 131.94 168 -list-push-aos 75.13 74.18 75.20 75.42 75.24 75.99 75.80 75.80 75.54 76.22 76.21 97 -list-push-soa 40.99 38.14 39.00 38.89 38.89 39.67 39.87 39.28 39.35 40.08 40.13 97 -list-pull-aos 82.07 82.88 83.29 83.09 83.32 83.49 82.82 82.88 83.32 82.60 82.93 97 -list-pull-soa 62.07 60.40 61.89 61.39 62.43 60.90 60.48 62.80 62.50 61.10 60.38 97 -list-pull-split-nt-1s-soa 125.81 120.60 121.96 122.34 122.86 123.53 123.64 123.67 125.94 124.09 123.69 128 -list-pull-split-nt-2s-soa 122.79 117.16 118.86 119.16 119.56 119.99 120.01 120.03 122.64 120.57 120.39 128 -list-aa-aos 128.13 127.41 129.31 129.07 129.79 129.63 129.67 129.94 129.12 128.41 129.72 150 -list-aa-soa 141.60 139.78 141.58 142.16 141.94 141.31 142.37 142.25 142.43 141.40 142.26 150 -list-aa-ria-soa 141.82 134.88 140.15 140.72 141.67 140.51 141.18 141.29 142.97 141.94 143.25 168 -list-aa-pv-soa 164.79 140.95 159.24 161.78 162.40 163.04 164.69 164.38 165.11 165.75 166.09 168 -========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ======= - - -Skylake, Intel Xeon Gold 6148 ------------------------------ - -- Skylake architecture, AVX2, FMA, AVX512 +NOTE: currently we only use AVX2 intrinsics. + +- Skylake server architecture, AVX2, AVX512, 2 FMA units - 20 cores, 2.4 GHz - SMT enabled +- memory bandwidth: + + - copy-19 89.7 GB/s + - copy-19-nt-sl 92.4 GB/s + - update-19 93.6 GB/s + +**Zen, AMD EPYC 7451** + +- Zen architecture, AVX2, FMA +- 24 cores, 2.3 GHz +- SMT enabled +- memory bandwidth: + + - copy-19 111.9 GB/s + - copy-19-nt-sl 111.7 GB/s + - update-19 109.2 GB/s + +**Zen, AMD Ryzen 7 1700X** + +- Zen architecture, AVX2, FMA +- 8 cores, 3.4 GHz +- SMT enabled +- memory bandwidth: + + - copy-19 27.2 GB/s + - copy-19-nt-sl 27.1 GB/s + - update-19 26.1 GB/s + +Single Socket Results +--------------------- + +- Geometry dimensions are for all measurements 500x100x100 nodes. +- Note the **different scaling on the y axis** of the plots! + +.. |perf_emmy_dp| image:: images/benchmark-emmy-dp.png + :scale: 50 % +.. |perf_emmy_sp| image:: images/benchmark-emmy-sp.png + :scale: 50 % +.. |perf_hasep1_dp| image:: images/benchmark-hasep1-dp.png + :scale: 50 % +.. |perf_hasep1_sp| image:: images/benchmark-hasep1-sp.png + :scale: 50 % +.. |perf_meggie_dp| image:: images/benchmark-meggie-dp.png + :scale: 50 % +.. |perf_meggie_sp| image:: images/benchmark-meggie-sp.png + :scale: 50 % +.. |perf_skylakesp2_dp| image:: images/benchmark-skylakesp2-dp.png + :scale: 50 % +.. |perf_skylakesp2_sp| image:: images/benchmark-skylakesp2-sp.png + :scale: 50 % +.. |perf_summitridge1_dp| image:: images/benchmark-summitridge1-dp.png + :scale: 50 % +.. |perf_summitridge1_sp| image:: images/benchmark-summitridge1-sp.png + :scale: 50 % +.. |perf_naples1_dp| image:: images/benchmark-naples1-dp.png + :scale: 50 % +.. |perf_naples1_sp| image:: images/benchmark-naples1-sp.png + :scale: 50 % + +.. list-table:: + + * - Ivy Bridge, Intel Xeon E5-2660 v2, Double Precision + * - |perf_emmy_dp| + * - Ivy Bridge, Intel Xeon E5-2660 v2, Single Precision + * - |perf_emmy_sp| + * - Haswell, Intel Xeon E5-2695 v3, Double Precision + * - |perf_hasep1_dp| + * - Haswell, Intel Xeon E5-2695 v3, Single Precision + * - |perf_hasep1_sp| + * - Broadwell, Intel Xeon E5-2630 v4, Double Precision + * - |perf_meggie_dp| + * - Broadwell, Intel Xeon E5-2630 v4, Single Precision + * - |perf_meggie_sp| + * - Skylake, Intel Xeon Gold 6148, Double Precision, **NOTE: currently we only use AVX2 intrinsics.** + * - |perf_skylakesp2_dp| + * - Skylake, Intel Xeon Gold 6148, Single Precision, **NOTE: currently we only use AVX2 intrinsics.** + * - |perf_skylakesp2_sp| + * - Zen, AMD Ryzen 7 1700X, Double Precision + * - |perf_summitridge1_dp| + * - Zen, AMD Ryzen 7 1700X, Single Precision + * - |perf_summitridge1_sp| + * - Zen, AMD EPYC 7451, Double Precision + * - |perf_naples1_dp| + * - Zen, AMD EPYC 7451, Single Precision + * - |perf_naples1_sp| -memory bandwidth: - -- copy-19 89.7 GB/s -- copy-19-nt-sl 92.4 GB/s -- update-19 93.6 GB/s - -geometry dimensions: 500x100x100 - - -========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= === -kernel pipe blocks-2 blocks-4 blocks-6 blocks-8 blocks-10 blocks-15 blocks-16 blocks-20 blocks-25 blocks-32 RFM -========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= === -blk-push-aos 113.01 93.99 108.98 114.65 117.87 119.47 124.95 122.46 129.29 123.87 133.01 197 -blk-push-soa 100.21 98.87 103.63 105.56 107.02 107.27 111.61 109.83 116.16 110.51 110.29 197 -blk-pull-aos 118.45 102.54 114.12 117.82 122.69 124.31 130.58 127.85 135.72 129.65 139.94 197 -blk-pull-soa 82.60 83.36 87.13 88.39 88.84 88.96 92.48 90.93 95.79 91.92 98.64 197 -aa-aos 171.32 125.43 147.73 157.70 163.35 167.25 175.39 174.20 182.54 173.67 187.76 308 -aa-soa 180.85 152.39 165.84 152.59 171.90 175.76 184.94 182.34 189.43 180.30 193.54 308 -aa-vec-soa 208.03 181.51 195.86 203.41 209.08 212.34 224.05 219.49 234.31 225.92 245.22 308 -list-push-aos 158.81 164.67 162.93 163.05 165.22 164.31 164.66 160.78 164.07 165.19 164.06 177 -list-push-soa 134.60 110.44 110.17 132.01 132.95 133.46 134.37 134.33 135.12 134.91 137.87 177 -list-pull-aos 169.61 170.03 170.89 170.90 171.20 171.60 172.09 171.95 169.48 172.08 171.02 177 -list-pull-soa 120.50 116.73 118.62 118.00 120.99 118.15 117.17 121.41 120.83 120.00 118.74 177 -list-pull-split-nt-1s-soa 225.59 224.18 225.10 226.34 226.01 230.37 227.50 228.42 227.39 231.65 227.35 246 -list-pull-split-nt-2s-soa 219.20 214.63 217.61 218.13 219.07 221.01 219.88 220.09 220.62 221.68 220.58 246 -list-aa-aos 241.39 239.27 239.53 242.56 242.46 243.00 242.91 242.46 241.24 242.96 241.52 275 -list-aa-soa 273.73 268.49 268.48 271.79 275.29 274.56 277.18 272.67 274.21 275.24 278.21 275 -list-aa-ria-soa 288.42 261.89 273.26 284.84 283.88 288.29 290.72 289.81 293.36 290.75 292.93 308 -list-aa-pv-soa 303.35 267.21 289.18 294.96 294.36 298.16 300.45 301.71 302.37 302.88 304.46 308 -========================= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= ========= === Licence ======= diff --git a/src/Base.h b/src/Base.h index ae61082..a848cf8 100644 --- a/src/Base.h +++ b/src/Base.h @@ -27,6 +27,8 @@ #ifndef __BASE_H__ #define __BASE_H__ +#include "Config.h" + #include #include #include diff --git a/src/BenchKernelD3Q19.c b/src/BenchKernelD3Q19.c index bd6d43c..4852948 100644 --- a/src/BenchKernelD3Q19.c +++ b/src/BenchKernelD3Q19.c @@ -43,8 +43,8 @@ void FNAME(D3Q19Kernel)(LatticeDesc * ld, KernelData * kernelData, CaseData * cd Assert(kernelData != NULL); Assert(cd != NULL); - Assert(cd->Omega > 0.0); - Assert(cd->Omega < 2.0); + Assert(cd->Omega > F(0.0)); + Assert(cd->Omega < F(2.0)); KernelData * kd = (KernelData *)kernelData; @@ -61,20 +61,24 @@ void FNAME(D3Q19Kernel)(LatticeDesc * ld, KernelData * kernelData, CaseData * cd PdfT omega = cd->Omega; PdfT omegaEven = omega; -// PdfT omegaOdd = 8.0*((2.0-omegaEven)/(8.0-omegaEven)); //"standard" trt odd relaxation parameter - PdfT magicParam = 1.0/12.0; // 1/4: best stability; 1/12: removes third-order advection error (best advection); 1/6: removes fourth-order diffusion error (best diffusion); 3/16: exact location of bounce back for poiseuille flow - PdfT omegaOdd = 1.0/( 0.5 + magicParam/(1.0/omega - 0.5) ); + // PdfT omegaOdd = 8.0*((F(2.0)-omegaEven)/(8.0-omegaEven)); //"standard" trt odd relaxation parameter + PdfT magicParam = F(1.0) / F(12.0); + // 1/ 4: best stability; + // 1/12: removes third-order advection error (best advection); + // 1/ 6: removes fourth-order diffusion error (best diffusion); + // 3/16: exact location of bounce back for poiseuille flow + PdfT omegaOdd = F(1.0) / (F(0.5) + magicParam / (F(1.0) / omega - F(0.5))); - PdfT evenPart = 0.0; - PdfT oddPart = 0.0; - PdfT dir_indep_trm = 0.0; + PdfT evenPart = F(0.0); + PdfT oddPart = F(0.0); + PdfT dir_indep_trm = F(0.0); - PdfT w_0 = 1.0 / 3.0; - PdfT w_1 = 1.0 / 18.0; - PdfT w_2 = 1.0 / 36.0; + PdfT w_0 = F(1.0) / F( 3.0); + PdfT w_1 = F(1.0) / F(18.0); + PdfT w_2 = F(1.0) / F(36.0); - PdfT w_1_x3 = w_1 * 3.0; PdfT w_1_nine_half = w_1 * 9.0/2.0; PdfT w_1_indep = 0.0; - PdfT w_2_x3 = w_2 * 3.0; PdfT w_2_nine_half = w_2 * 9.0/2.0; PdfT w_2_indep = 0.0; + PdfT w_1_x3 = w_1 * F(3.0); PdfT w_1_nine_half = w_1 * F(9.0)/F(2.0); PdfT w_1_indep = F(0.0); + PdfT w_2_x3 = w_2 * F(3.0); PdfT w_2_nine_half = w_2 * F(9.0)/F(2.0); PdfT w_2_indep = F(0.0); PdfT ux, uy, uz, ui; PdfT dens; @@ -102,7 +106,7 @@ void FNAME(D3Q19Kernel)(LatticeDesc * ld, KernelData * kernelData, CaseData * cd X_LIKWID_START("os"); #ifdef _OPENMP - #pragma omp parallel for collapse(3) default(none) \ + #pragma omp parallel for collapse(2) default(none) \ shared(gDims,src, dst, w_0, w_1, w_2, omegaEven, omegaOdd, \ w_1_x3, w_2_x3, w_1_nine_half, w_2_nine_half, cd, \ oX, oY, oZ, nX, nY, nZ) \ @@ -114,9 +118,13 @@ void FNAME(D3Q19Kernel)(LatticeDesc * ld, KernelData * kernelData, CaseData * cd pdf_B, pdf_BN, pdf_BE, pdf_BS, pdf_BW, \ evenPart, oddPart, w_1_indep, w_2_indep) #endif - for (int x = oX; x < nX + oX; ++x) { for (int y = oY; y < nY + oY; ++y) { + #ifdef INTEL_OPT_DIRECTIVES + #pragma ivdep + #pragma vector always + #pragma simd + #endif for (int z = oZ; z < nZ + oZ; ++z) { #define I(x, y, z, dir) P_INDEX_5(gDims, (x), (y), (z), (dir)) @@ -145,9 +153,9 @@ void FNAME(D3Q19Kernel)(LatticeDesc * ld, KernelData * kernelData, CaseData * cd #ifdef LID_DRIVEN_CAVITY if (z == nZ - 4 + oZ && x > 3 + oX && x < (nX - 4 + oX) && y > 3 + oY && y < (nY - 4 + oY)) { - ux = 0.1 * 0.577; - uy = 0.0; - uz = 0.0; + ux = F(0.1 * 0.577); + uy = F(0.0); + uz = F(0.0); } else { #endif @@ -168,7 +176,7 @@ void FNAME(D3Q19Kernel)(LatticeDesc * ld, KernelData * kernelData, CaseData * cd pdf_T + pdf_TN + pdf_TE + pdf_TS + pdf_TW + pdf_B + pdf_BN + pdf_BE + pdf_BS + pdf_BW; - dir_indep_trm = dens - (ux * ux + uy * uy + uz * uz)*3.0/2.0; + dir_indep_trm = dens - (ux * ux + uy * uy + uz * uz) * F(3.0) / F(2.0); #ifdef PROP_MODEL_PUSH @@ -179,20 +187,20 @@ void FNAME(D3Q19Kernel)(LatticeDesc * ld, KernelData * kernelData, CaseData * cd w_1_indep = w_1*dir_indep_trm; ui = uy; - evenPart = omegaEven*( 0.5*(pdf_N + pdf_S) - ui*ui*w_1_nine_half - w_1_indep ); - oddPart = omegaOdd*(0.5*(pdf_N - pdf_S) - ui*w_1_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_N + pdf_S) - ui*ui*w_1_nine_half - w_1_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_N - pdf_S) - ui*w_1_x3 ); dst[I(x, y + 1, z, D3Q19_N)] = pdf_N - evenPart - oddPart; dst[I(x, y - 1, z, D3Q19_S)] = pdf_S - evenPart + oddPart; ui = ux; - evenPart = omegaEven*( 0.5*(pdf_E + pdf_W) - ui*ui*w_1_nine_half - w_1_indep ); - oddPart = omegaOdd*(0.5*(pdf_E - pdf_W) - ui*w_1_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_E + pdf_W) - ui*ui*w_1_nine_half - w_1_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_E - pdf_W) - ui*w_1_x3 ); dst[I(x + 1, y, z, D3Q19_E)] = pdf_E - evenPart - oddPart; dst[I(x - 1, y, z, D3Q19_W)] = pdf_W - evenPart + oddPart; ui = uz; - evenPart = omegaEven*( 0.5*(pdf_T + pdf_B) - ui*ui*w_1_nine_half - w_1_indep ); - oddPart = omegaOdd*(0.5*(pdf_T - pdf_B) - ui*w_1_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_T + pdf_B) - ui*ui*w_1_nine_half - w_1_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_T - pdf_B) - ui*w_1_x3 ); dst[I(x, y, z + 1, D3Q19_T)] = pdf_T - evenPart - oddPart; dst[I(x, y, z - 1, D3Q19_B)] = pdf_B - evenPart + oddPart; @@ -200,38 +208,38 @@ void FNAME(D3Q19Kernel)(LatticeDesc * ld, KernelData * kernelData, CaseData * cd w_2_indep = w_2*dir_indep_trm; ui = -ux + uy; - evenPart = omegaEven*( 0.5*(pdf_NW + pdf_SE) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_NW - pdf_SE) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_NW + pdf_SE) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_NW - pdf_SE) - ui*w_2_x3 ); dst[I(x - 1, y + 1, z, D3Q19_NW)] = pdf_NW - evenPart - oddPart; dst[I(x + 1, y - 1, z, D3Q19_SE)] = pdf_SE - evenPart + oddPart; ui = ux + uy; - evenPart = omegaEven*( 0.5*(pdf_NE + pdf_SW) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_NE - pdf_SW) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_NE + pdf_SW) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_NE - pdf_SW) - ui*w_2_x3 ); dst[I(x + 1, y + 1, z, D3Q19_NE)] = pdf_NE - evenPart - oddPart; dst[I(x - 1, y - 1, z, D3Q19_SW)] = pdf_SW - evenPart + oddPart; ui = -ux + uz; - evenPart = omegaEven*( 0.5*(pdf_TW + pdf_BE) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_TW - pdf_BE) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_TW + pdf_BE) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_TW - pdf_BE) - ui*w_2_x3 ); dst[I(x - 1, y, z + 1, D3Q19_TW)] = pdf_TW - evenPart - oddPart; dst[I(x + 1, y, z - 1, D3Q19_BE)] = pdf_BE - evenPart + oddPart; ui = ux + uz; - evenPart = omegaEven*( 0.5*(pdf_TE + pdf_BW) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_TE - pdf_BW) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_TE + pdf_BW) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_TE - pdf_BW) - ui*w_2_x3 ); dst[I(x + 1, y, z + 1, D3Q19_TE)] = pdf_TE - evenPart - oddPart; dst[I(x - 1, y, z - 1, D3Q19_BW)] = pdf_BW - evenPart + oddPart; ui = -uy + uz; - evenPart = omegaEven*( 0.5*(pdf_TS + pdf_BN) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_TS - pdf_BN) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_TS + pdf_BN) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_TS - pdf_BN) - ui*w_2_x3 ); dst[I(x, y - 1, z + 1, D3Q19_TS)] = pdf_TS - evenPart - oddPart; dst[I(x, y + 1, z - 1, D3Q19_BN)] = pdf_BN - evenPart + oddPart; ui = uy + uz; - evenPart = omegaEven*( 0.5*(pdf_TN + pdf_BS) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_TN - pdf_BS) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_TN + pdf_BS) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_TN - pdf_BS) - ui*w_2_x3 ); dst[I(x, y + 1, z + 1, D3Q19_TN)] = pdf_TN - evenPart - oddPart; dst[I(x, y - 1, z - 1, D3Q19_BS)] = pdf_BS - evenPart + oddPart; @@ -244,20 +252,20 @@ void FNAME(D3Q19Kernel)(LatticeDesc * ld, KernelData * kernelData, CaseData * cd w_1_indep = w_1*dir_indep_trm; ui = uy; - evenPart = omegaEven*( 0.5*(pdf_N + pdf_S) - ui*ui*w_1_nine_half - w_1_indep ); - oddPart = omegaOdd*(0.5*(pdf_N - pdf_S) - ui*w_1_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_N + pdf_S) - ui*ui*w_1_nine_half - w_1_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_N - pdf_S) - ui*w_1_x3 ); dst[I(x, y, z, D3Q19_N)] = pdf_N - evenPart - oddPart; dst[I(x, y, z, D3Q19_S)] = pdf_S - evenPart + oddPart; ui = ux; - evenPart = omegaEven*( 0.5*(pdf_E + pdf_W) - ui*ui*w_1_nine_half - w_1_indep ); - oddPart = omegaOdd*(0.5*(pdf_E - pdf_W) - ui*w_1_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_E + pdf_W) - ui*ui*w_1_nine_half - w_1_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_E - pdf_W) - ui*w_1_x3 ); dst[I(x, y, z, D3Q19_E)] = pdf_E - evenPart - oddPart; dst[I(x, y, z, D3Q19_W)] = pdf_W - evenPart + oddPart; ui = uz; - evenPart = omegaEven*( 0.5*(pdf_T + pdf_B) - ui*ui*w_1_nine_half - w_1_indep ); - oddPart = omegaOdd*(0.5*(pdf_T - pdf_B) - ui*w_1_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_T + pdf_B) - ui*ui*w_1_nine_half - w_1_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_T - pdf_B) - ui*w_1_x3 ); dst[I(x, y, z, D3Q19_T)] = pdf_T - evenPart - oddPart; dst[I(x, y, z, D3Q19_B)] = pdf_B - evenPart + oddPart; @@ -265,38 +273,38 @@ void FNAME(D3Q19Kernel)(LatticeDesc * ld, KernelData * kernelData, CaseData * cd w_2_indep = w_2*dir_indep_trm; ui = -ux + uy; - evenPart = omegaEven*( 0.5*(pdf_NW + pdf_SE) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_NW - pdf_SE) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_NW + pdf_SE) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_NW - pdf_SE) - ui*w_2_x3 ); dst[I(x, y, z, D3Q19_NW)] = pdf_NW - evenPart - oddPart; dst[I(x, y, z, D3Q19_SE)] = pdf_SE - evenPart + oddPart; ui = ux + uy; - evenPart = omegaEven*( 0.5*(pdf_NE + pdf_SW) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_NE - pdf_SW) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_NE + pdf_SW) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_NE - pdf_SW) - ui*w_2_x3 ); dst[I(x, y, z, D3Q19_NE)] = pdf_NE - evenPart - oddPart; dst[I(x, y, z, D3Q19_SW)] = pdf_SW - evenPart + oddPart; ui = -ux + uz; - evenPart = omegaEven*( 0.5*(pdf_TW + pdf_BE) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_TW - pdf_BE) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_TW + pdf_BE) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_TW - pdf_BE) - ui*w_2_x3 ); dst[I(x, y, z, D3Q19_TW)] = pdf_TW - evenPart - oddPart; dst[I(x, y, z, D3Q19_BE)] = pdf_BE - evenPart + oddPart; ui = ux + uz; - evenPart = omegaEven*( 0.5*(pdf_TE + pdf_BW) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_TE - pdf_BW) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_TE + pdf_BW) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_TE - pdf_BW) - ui*w_2_x3 ); dst[I(x, y, z, D3Q19_TE)] = pdf_TE - evenPart - oddPart; dst[I(x, y, z, D3Q19_BW)] = pdf_BW - evenPart + oddPart; ui = -uy + uz; - evenPart = omegaEven*( 0.5*(pdf_TS + pdf_BN) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_TS - pdf_BN) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_TS + pdf_BN) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_TS - pdf_BN) - ui*w_2_x3 ); dst[I(x, y, z, D3Q19_TS)] = pdf_TS - evenPart - oddPart; dst[I(x, y, z, D3Q19_BN)] = pdf_BN - evenPart + oddPart; ui = uy + uz; - evenPart = omegaEven*( 0.5*(pdf_TN + pdf_BS) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_TN - pdf_BS) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_TN + pdf_BS) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_TN - pdf_BS) - ui*w_2_x3 ); dst[I(x, y, z, D3Q19_TN)] = pdf_TN - evenPart - oddPart; dst[I(x, y, z, D3Q19_BS)] = pdf_BS - evenPart + oddPart; @@ -366,8 +374,8 @@ void FNAME(D3Q19BlkKernel)(LatticeDesc * ld, KernelData * kernelData, CaseData * Assert(kernelData != NULL); Assert(cd != NULL); - Assert(cd->Omega > 0.0); - Assert(cd->Omega < 2.0); + Assert(cd->Omega > F(0.0)); + Assert(cd->Omega < F(2.0)); KernelData * kd = (KernelData *)kernelData; @@ -391,20 +399,24 @@ void FNAME(D3Q19BlkKernel)(LatticeDesc * ld, KernelData * kernelData, CaseData * PdfT omega = cd->Omega; PdfT omegaEven = omega; -// PdfT omegaOdd = 8.0*((2.0-omegaEven)/(8.0-omegaEven)); //"standard" trt odd relaxation parameter - PdfT magicParam = 1.0/12.0; // 1/4: best stability; 1/12: removes third-order advection error (best advection); 1/6: removes fourth-order diffusion error (best diffusion); 3/16: exact location of bounce back for poiseuille flow - PdfT omegaOdd = 1.0/( 0.5 + magicParam/(1.0/omega - 0.5) ); + // PdfT omegaOdd = 8.0*((F(2.0)-omegaEven)/(8.0-omegaEven)); //"standard" trt odd relaxation parameter + PdfT magicParam = F(1.0)/F(12.0); + // 1/ 4: best stability; + // 1/12: removes third-order advection error (best advection); + // 1/ 6: removes fourth-order diffusion error (best diffusion); + // 3/16: exact location of bounce back for poiseuille flow + PdfT omegaOdd = F(1.0) / (F(0.5) + magicParam / (F(1.0) / omega - F(0.5))); - PdfT evenPart = 0.0; - PdfT oddPart = 0.0; - PdfT dir_indep_trm = 0.0; + PdfT evenPart = F(0.0); + PdfT oddPart = F(0.0); + PdfT dir_indep_trm = F(0.0); - PdfT w_0 = 1.0 / 3.0; - PdfT w_1 = 1.0 / 18.0; - PdfT w_2 = 1.0 / 36.0; + PdfT w_0 = F(1.0) / F( 3.0); + PdfT w_1 = F(1.0) / F(18.0); + PdfT w_2 = F(1.0) / F(36.0); - PdfT w_1_x3 = w_1 * 3.0; PdfT w_1_nine_half = w_1 * 9.0/2.0; PdfT w_1_indep = 0.0; - PdfT w_2_x3 = w_2 * 3.0; PdfT w_2_nine_half = w_2 * 9.0/2.0; PdfT w_2_indep = 0.0; + PdfT w_1_x3 = w_1 * F(3.0); PdfT w_1_nine_half = w_1 * F(9.0)/F(2.0); PdfT w_1_indep = F(0.0); + PdfT w_2_x3 = w_2 * F(3.0); PdfT w_2_nine_half = w_2 * F(9.0)/F(2.0); PdfT w_2_indep = F(0.0); PdfT ux, uy, uz, ui; PdfT dens; @@ -478,6 +490,11 @@ void FNAME(D3Q19BlkKernel)(LatticeDesc * ld, KernelData * kernelData, CaseData * for (int x = bX; x < eX; ++x) { for (int y = bY; y < eY; ++y) { + #ifdef INTEL_OPT_DIRECTIVES + #pragma ivdep + #pragma vector always + #pragma simd + #endif for (int z = bZ; z < eZ; ++z) { #define I(x, y, z, dir) P_INDEX_5(gDims, (x), (y), (z), (dir)) @@ -505,7 +522,7 @@ void FNAME(D3Q19BlkKernel)(LatticeDesc * ld, KernelData * kernelData, CaseData * #ifdef LID_DRIVEN_CAVITY if (z == nZ - 4 + oZ && x > 3 + oX && x < (nX - 4 + oX) && y > 3 + oY && y < (nY - 4 + oY)) { - ux = 0.1 * 0.577; + ux = 0.1 * F(0.5)77; uy = 0.0; uz = 0.0; @@ -528,7 +545,7 @@ void FNAME(D3Q19BlkKernel)(LatticeDesc * ld, KernelData * kernelData, CaseData * pdf_T + pdf_TN + pdf_TE + pdf_TS + pdf_TW + pdf_B + pdf_BN + pdf_BE + pdf_BS + pdf_BW; - dir_indep_trm = dens - (ux * ux + uy * uy + uz * uz)*3.0/2.0; + dir_indep_trm = dens - (ux * ux + uy * uy + uz * uz) * F(3.0) / F(2.0); #ifdef PROP_MODEL_PUSH @@ -539,20 +556,20 @@ void FNAME(D3Q19BlkKernel)(LatticeDesc * ld, KernelData * kernelData, CaseData * w_1_indep = w_1*dir_indep_trm; ui = uy; - evenPart = omegaEven*( 0.5*(pdf_N + pdf_S) - ui*ui*w_1_nine_half - w_1_indep ); - oddPart = omegaOdd*(0.5*(pdf_N - pdf_S) - ui*w_1_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_N + pdf_S) - ui*ui*w_1_nine_half - w_1_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_N - pdf_S) - ui*w_1_x3 ); dst[I(x, y + 1, z, D3Q19_N)] = pdf_N - evenPart - oddPart; dst[I(x, y - 1, z, D3Q19_S)] = pdf_S - evenPart + oddPart; ui = ux; - evenPart = omegaEven*( 0.5*(pdf_E + pdf_W) - ui*ui*w_1_nine_half - w_1_indep ); - oddPart = omegaOdd*(0.5*(pdf_E - pdf_W) - ui*w_1_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_E + pdf_W) - ui*ui*w_1_nine_half - w_1_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_E - pdf_W) - ui*w_1_x3 ); dst[I(x + 1, y, z, D3Q19_E)] = pdf_E - evenPart - oddPart; dst[I(x - 1, y, z, D3Q19_W)] = pdf_W - evenPart + oddPart; ui = uz; - evenPart = omegaEven*( 0.5*(pdf_T + pdf_B) - ui*ui*w_1_nine_half - w_1_indep ); - oddPart = omegaOdd*(0.5*(pdf_T - pdf_B) - ui*w_1_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_T + pdf_B) - ui*ui*w_1_nine_half - w_1_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_T - pdf_B) - ui*w_1_x3 ); dst[I(x, y, z + 1, D3Q19_T)] = pdf_T - evenPart - oddPart; dst[I(x, y, z - 1, D3Q19_B)] = pdf_B - evenPart + oddPart; @@ -560,38 +577,38 @@ void FNAME(D3Q19BlkKernel)(LatticeDesc * ld, KernelData * kernelData, CaseData * w_2_indep = w_2*dir_indep_trm; ui = -ux + uy; - evenPart = omegaEven*( 0.5*(pdf_NW + pdf_SE) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_NW - pdf_SE) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_NW + pdf_SE) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_NW - pdf_SE) - ui*w_2_x3 ); dst[I(x - 1, y + 1, z, D3Q19_NW)] = pdf_NW - evenPart - oddPart; dst[I(x + 1, y - 1, z, D3Q19_SE)] = pdf_SE - evenPart + oddPart; ui = ux + uy; - evenPart = omegaEven*( 0.5*(pdf_NE + pdf_SW) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_NE - pdf_SW) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_NE + pdf_SW) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_NE - pdf_SW) - ui*w_2_x3 ); dst[I(x + 1, y + 1, z, D3Q19_NE)] = pdf_NE - evenPart - oddPart; dst[I(x - 1, y - 1, z, D3Q19_SW)] = pdf_SW - evenPart + oddPart; ui = -ux + uz; - evenPart = omegaEven*( 0.5*(pdf_TW + pdf_BE) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_TW - pdf_BE) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_TW + pdf_BE) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_TW - pdf_BE) - ui*w_2_x3 ); dst[I(x - 1, y, z + 1, D3Q19_TW)] = pdf_TW - evenPart - oddPart; dst[I(x + 1, y, z - 1, D3Q19_BE)] = pdf_BE - evenPart + oddPart; ui = ux + uz; - evenPart = omegaEven*( 0.5*(pdf_TE + pdf_BW) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_TE - pdf_BW) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_TE + pdf_BW) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_TE - pdf_BW) - ui*w_2_x3 ); dst[I(x + 1, y, z + 1, D3Q19_TE)] = pdf_TE - evenPart - oddPart; dst[I(x - 1, y, z - 1, D3Q19_BW)] = pdf_BW - evenPart + oddPart; ui = -uy + uz; - evenPart = omegaEven*( 0.5*(pdf_TS + pdf_BN) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_TS - pdf_BN) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_TS + pdf_BN) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_TS - pdf_BN) - ui*w_2_x3 ); dst[I(x, y - 1, z + 1, D3Q19_TS)] = pdf_TS - evenPart - oddPart; dst[I(x, y + 1, z - 1, D3Q19_BN)] = pdf_BN - evenPart + oddPart; ui = uy + uz; - evenPart = omegaEven*( 0.5*(pdf_TN + pdf_BS) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_TN - pdf_BS) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_TN + pdf_BS) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_TN - pdf_BS) - ui*w_2_x3 ); dst[I(x, y + 1, z + 1, D3Q19_TN)] = pdf_TN - evenPart - oddPart; dst[I(x, y - 1, z - 1, D3Q19_BS)] = pdf_BS - evenPart + oddPart; @@ -604,20 +621,20 @@ void FNAME(D3Q19BlkKernel)(LatticeDesc * ld, KernelData * kernelData, CaseData * w_1_indep = w_1*dir_indep_trm; ui = uy; - evenPart = omegaEven*( 0.5*(pdf_N + pdf_S) - ui*ui*w_1_nine_half - w_1_indep ); - oddPart = omegaOdd*(0.5*(pdf_N - pdf_S) - ui*w_1_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_N + pdf_S) - ui*ui*w_1_nine_half - w_1_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_N - pdf_S) - ui*w_1_x3 ); dst[I(x, y, z, D3Q19_N)] = pdf_N - evenPart - oddPart; dst[I(x, y, z, D3Q19_S)] = pdf_S - evenPart + oddPart; ui = ux; - evenPart = omegaEven*( 0.5*(pdf_E + pdf_W) - ui*ui*w_1_nine_half - w_1_indep ); - oddPart = omegaOdd*(0.5*(pdf_E - pdf_W) - ui*w_1_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_E + pdf_W) - ui*ui*w_1_nine_half - w_1_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_E - pdf_W) - ui*w_1_x3 ); dst[I(x, y, z, D3Q19_E)] = pdf_E - evenPart - oddPart; dst[I(x, y, z, D3Q19_W)] = pdf_W - evenPart + oddPart; ui = uz; - evenPart = omegaEven*( 0.5*(pdf_T + pdf_B) - ui*ui*w_1_nine_half - w_1_indep ); - oddPart = omegaOdd*(0.5*(pdf_T - pdf_B) - ui*w_1_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_T + pdf_B) - ui*ui*w_1_nine_half - w_1_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_T - pdf_B) - ui*w_1_x3 ); dst[I(x, y, z, D3Q19_T)] = pdf_T - evenPart - oddPart; dst[I(x, y, z, D3Q19_B)] = pdf_B - evenPart + oddPart; @@ -625,38 +642,38 @@ void FNAME(D3Q19BlkKernel)(LatticeDesc * ld, KernelData * kernelData, CaseData * w_2_indep = w_2*dir_indep_trm; ui = -ux + uy; - evenPart = omegaEven*( 0.5*(pdf_NW + pdf_SE) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_NW - pdf_SE) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_NW + pdf_SE) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_NW - pdf_SE) - ui*w_2_x3 ); dst[I(x, y, z, D3Q19_NW)] = pdf_NW - evenPart - oddPart; dst[I(x, y, z, D3Q19_SE)] = pdf_SE - evenPart + oddPart; ui = ux + uy; - evenPart = omegaEven*( 0.5*(pdf_NE + pdf_SW) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_NE - pdf_SW) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_NE + pdf_SW) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_NE - pdf_SW) - ui*w_2_x3 ); dst[I(x, y, z, D3Q19_NE)] = pdf_NE - evenPart - oddPart; dst[I(x, y, z, D3Q19_SW)] = pdf_SW - evenPart + oddPart; ui = -ux + uz; - evenPart = omegaEven*( 0.5*(pdf_TW + pdf_BE) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_TW - pdf_BE) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_TW + pdf_BE) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_TW - pdf_BE) - ui*w_2_x3 ); dst[I(x, y, z, D3Q19_TW)] = pdf_TW - evenPart - oddPart; dst[I(x, y, z, D3Q19_BE)] = pdf_BE - evenPart + oddPart; ui = ux + uz; - evenPart = omegaEven*( 0.5*(pdf_TE + pdf_BW) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_TE - pdf_BW) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_TE + pdf_BW) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_TE - pdf_BW) - ui*w_2_x3 ); dst[I(x, y, z, D3Q19_TE)] = pdf_TE - evenPart - oddPart; dst[I(x, y, z, D3Q19_BW)] = pdf_BW - evenPart + oddPart; ui = -uy + uz; - evenPart = omegaEven*( 0.5*(pdf_TS + pdf_BN) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_TS - pdf_BN) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_TS + pdf_BN) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_TS - pdf_BN) - ui*w_2_x3 ); dst[I(x, y, z, D3Q19_TS)] = pdf_TS - evenPart - oddPart; dst[I(x, y, z, D3Q19_BN)] = pdf_BN - evenPart + oddPart; ui = uy + uz; - evenPart = omegaEven*( 0.5*(pdf_TN + pdf_BS) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_TN - pdf_BS) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_TN + pdf_BS) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_TN - pdf_BS) - ui*w_2_x3 ); dst[I(x, y, z, D3Q19_TN)] = pdf_TN - evenPart - oddPart; dst[I(x, y, z, D3Q19_BS)] = pdf_BS - evenPart + oddPart; diff --git a/src/BenchKernelD3Q19Aa.c b/src/BenchKernelD3Q19Aa.c index 87a9e33..a6f73fc 100644 --- a/src/BenchKernelD3Q19Aa.c +++ b/src/BenchKernelD3Q19Aa.c @@ -43,8 +43,8 @@ void FNAME(D3Q19AaKernel)(LatticeDesc * ld, KernelData * kernelData, CaseData * Assert(kernelData != NULL); Assert(cd != NULL); - Assert(cd->Omega > 0.0); - Assert(cd->Omega < 2.0); + Assert(cd->Omega > F(0.0)); + Assert(cd->Omega < F(2.0)); KernelData * kd = (KernelData *)kernelData; @@ -68,24 +68,24 @@ void FNAME(D3Q19AaKernel)(LatticeDesc * ld, KernelData * kernelData, CaseData * PdfT omega = cd->Omega; PdfT omegaEven = omega; - PdfT magicParam = 1.0 / 12.0; + PdfT magicParam = F(1.0) / F(12.0); // 1/4: best stability; // 1/12: removes third-order advection error (best advection); // 1/6: removes fourth-order diffusion error (best diffusion); // 3/16: exact location of bounce back for poiseuille flow - PdfT omegaOdd = 1.0/( 0.5 + magicParam/(1.0/omega - 0.5) ); + PdfT omegaOdd = F(1.0)/( F(0.5) + magicParam/(F(1.0)/omega - F(0.5)) ); - PdfT evenPart = 0.0; - PdfT oddPart = 0.0; - PdfT dir_indep_trm = 0.0; + PdfT evenPart = F(0.0); + PdfT oddPart = F(0.0); + PdfT dir_indep_trm = F(0.0); - PdfT w_0 = 1.0 / 3.0; - PdfT w_1 = 1.0 / 18.0; - PdfT w_2 = 1.0 / 36.0; + PdfT w_0 = F(1.0) / F(3.0); + PdfT w_1 = F(1.0) / F(18.0); + PdfT w_2 = F(1.0) / F(36.0); - PdfT w_1_x3 = w_1 * 3.0; PdfT w_1_nine_half = w_1 * 9.0/2.0; PdfT w_1_indep = 0.0; - PdfT w_2_x3 = w_2 * 3.0; PdfT w_2_nine_half = w_2 * 9.0/2.0; PdfT w_2_indep = 0.0; + PdfT w_1_x3 = w_1 * F(3.0); PdfT w_1_nine_half = w_1 * F(9.0)/F(2.0); PdfT w_1_indep = F(0.0); + PdfT w_2_x3 = w_2 * F(3.0); PdfT w_2_nine_half = w_2 * F(9.0)/F(2.0); PdfT w_2_indep = F(0.0); PdfT ux, uy, uz, ui; PdfT dens; @@ -169,6 +169,11 @@ void FNAME(D3Q19AaKernel)(LatticeDesc * ld, KernelData * kernelData, CaseData * for (int x = bX; x < eX; ++x) { for (int y = bY; y < eY; ++y) { + #ifdef INTEL_OPT_DIRECTIVES + #pragma ivdep + #pragma vector always + #pragma simd + #endif for (int z = bZ; z < eZ; ++z) { @@ -189,9 +194,9 @@ void FNAME(D3Q19AaKernel)(LatticeDesc * ld, KernelData * kernelData, CaseData * #ifdef LID_DRIVEN_CAVITY if (z == nZ - 4 + oZ && x > 3 + oX && x < (nX - 4 + oX) && y > 3 + oY && y < (nY - 4 + oY)) { - ux = 0.1 * 0.577; - uy = 0.0; - uz = 0.0; + ux = F(0.1) * F(0.5)77; + uy = F(0.0); + uz = F(0.0); } else { #endif @@ -211,7 +216,7 @@ void FNAME(D3Q19AaKernel)(LatticeDesc * ld, KernelData * kernelData, CaseData * pdf_T + pdf_TN + pdf_TE + pdf_TS + pdf_TW + pdf_B + pdf_BN + pdf_BE + pdf_BS + pdf_BW; - dir_indep_trm = dens - (ux * ux + uy * uy + uz * uz)*3.0/2.0; + dir_indep_trm = dens - (ux * ux + uy * uy + uz * uz)*F(3.0)/F(2.0); // direction: w_0 src[I(x, y, z, D3Q19_C)] = pdf_C - omegaEven*(pdf_C - w_0*dir_indep_trm); @@ -220,20 +225,20 @@ void FNAME(D3Q19AaKernel)(LatticeDesc * ld, KernelData * kernelData, CaseData * w_1_indep = w_1*dir_indep_trm; ui = uy; - evenPart = omegaEven*( 0.5*(pdf_N + pdf_S) - ui*ui*w_1_nine_half - w_1_indep ); - oddPart = omegaOdd*(0.5*(pdf_N - pdf_S) - ui*w_1_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_N + pdf_S) - ui*ui*w_1_nine_half - w_1_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_N - pdf_S) - ui*w_1_x3 ); src[I(x, y, z, D3Q19_S)] = pdf_N - evenPart - oddPart; src[I(x, y, z, D3Q19_N)] = pdf_S - evenPart + oddPart; ui = ux; - evenPart = omegaEven*( 0.5*(pdf_E + pdf_W) - ui*ui*w_1_nine_half - w_1_indep ); - oddPart = omegaOdd*(0.5*(pdf_E - pdf_W) - ui*w_1_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_E + pdf_W) - ui*ui*w_1_nine_half - w_1_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_E - pdf_W) - ui*w_1_x3 ); src[I(x, y, z, D3Q19_W)] = pdf_E - evenPart - oddPart; src[I(x, y, z, D3Q19_E)] = pdf_W - evenPart + oddPart; ui = uz; - evenPart = omegaEven*( 0.5*(pdf_T + pdf_B) - ui*ui*w_1_nine_half - w_1_indep ); - oddPart = omegaOdd*(0.5*(pdf_T - pdf_B) - ui*w_1_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_T + pdf_B) - ui*ui*w_1_nine_half - w_1_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_T - pdf_B) - ui*w_1_x3 ); src[I(x, y, z, D3Q19_B)] = pdf_T - evenPart - oddPart; src[I(x, y, z, D3Q19_T)] = pdf_B - evenPart + oddPart; @@ -241,38 +246,38 @@ void FNAME(D3Q19AaKernel)(LatticeDesc * ld, KernelData * kernelData, CaseData * w_2_indep = w_2*dir_indep_trm; ui = -ux + uy; - evenPart = omegaEven*( 0.5*(pdf_NW + pdf_SE) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_NW - pdf_SE) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_NW + pdf_SE) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_NW - pdf_SE) - ui*w_2_x3 ); src[I(x, y, z, D3Q19_SE)] = pdf_NW - evenPart - oddPart; src[I(x, y, z, D3Q19_NW)] = pdf_SE - evenPart + oddPart; ui = ux + uy; - evenPart = omegaEven*( 0.5*(pdf_NE + pdf_SW) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_NE - pdf_SW) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_NE + pdf_SW) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_NE - pdf_SW) - ui*w_2_x3 ); src[I(x, y, z, D3Q19_SW)] = pdf_NE - evenPart - oddPart; src[I(x, y, z, D3Q19_NE)] = pdf_SW - evenPart + oddPart; ui = -ux + uz; - evenPart = omegaEven*( 0.5*(pdf_TW + pdf_BE) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_TW - pdf_BE) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_TW + pdf_BE) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_TW - pdf_BE) - ui*w_2_x3 ); src[I(x, y, z, D3Q19_BE)] = pdf_TW - evenPart - oddPart; src[I(x, y, z, D3Q19_TW)] = pdf_BE - evenPart + oddPart; ui = ux + uz; - evenPart = omegaEven*( 0.5*(pdf_TE + pdf_BW) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_TE - pdf_BW) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_TE + pdf_BW) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_TE - pdf_BW) - ui*w_2_x3 ); src[I(x, y, z, D3Q19_BW)] = pdf_TE - evenPart - oddPart; src[I(x, y, z, D3Q19_TE)] = pdf_BW - evenPart + oddPart; ui = -uy + uz; - evenPart = omegaEven*( 0.5*(pdf_TS + pdf_BN) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_TS - pdf_BN) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_TS + pdf_BN) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_TS - pdf_BN) - ui*w_2_x3 ); src[I(x, y, z, D3Q19_BN)] = pdf_TS - evenPart - oddPart; src[I(x, y, z, D3Q19_TS)] = pdf_BN - evenPart + oddPart; ui = uy + uz; - evenPart = omegaEven*( 0.5*(pdf_TN + pdf_BS) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_TN - pdf_BS) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_TN + pdf_BS) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_TN - pdf_BS) - ui*w_2_x3 ); src[I(x, y, z, D3Q19_BS)] = pdf_TN - evenPart - oddPart; src[I(x, y, z, D3Q19_TN)] = pdf_BS - evenPart + oddPart; @@ -296,6 +301,9 @@ void FNAME(D3Q19AaKernel)(LatticeDesc * ld, KernelData * kernelData, CaseData * #pragma omp parallel for default(none) \ shared(kd, src) #endif + #ifdef INTEL_OPT_DIRECTIVES + #pragma ivdep + #endif for (int i = 0; i < kd->nBounceBackPdfs; ++i) { src[kd->BounceBackPdfsSrc[i]] = src[kd->BounceBackPdfsDst[i]]; } @@ -369,6 +377,11 @@ void FNAME(D3Q19AaKernel)(LatticeDesc * ld, KernelData * kernelData, CaseData * for (int x = bX; x < eX; ++x) { for (int y = bY; y < eY; ++y) { + #ifdef INTEL_OPT_DIRECTIVES + #pragma ivdep + #pragma vector always + #pragma simd + #endif for (int z = bZ; z < eZ; ++z) { #define I(x, y, z, dir) P_INDEX_5(gDims, (x), (y), (z), (dir)) @@ -384,9 +397,9 @@ void FNAME(D3Q19AaKernel)(LatticeDesc * ld, KernelData * kernelData, CaseData * #ifdef LID_DRIVEN_CAVITY if (z == nZ - 4 + oZ && x > 3 + oX && x < (nX - 4 + oX) && y > 3 + oY && y < (nY - 4 + oY)) { - ux = 0.1 * 0.577; - uy = 0.0; - uz = 0.0; + ux = F(0.1) * F(0.5)77; + uy = F(0.0); + uz = F(0.0); } else { #endif @@ -406,7 +419,7 @@ void FNAME(D3Q19AaKernel)(LatticeDesc * ld, KernelData * kernelData, CaseData * pdf_T + pdf_TN + pdf_TE + pdf_TS + pdf_TW + pdf_B + pdf_BN + pdf_BE + pdf_BS + pdf_BW; - dir_indep_trm = dens - (ux * ux + uy * uy + uz * uz)*3.0/2.0; + dir_indep_trm = dens - (ux * ux + uy * uy + uz * uz)*F(3.0)/F(2.0); // direction: w_0 src[I(x, y, z, D3Q19_C)] = pdf_C - omegaEven*(pdf_C - w_0*dir_indep_trm); @@ -415,20 +428,20 @@ void FNAME(D3Q19AaKernel)(LatticeDesc * ld, KernelData * kernelData, CaseData * w_1_indep = w_1*dir_indep_trm; ui = uy; - evenPart = omegaEven*( 0.5*(pdf_N + pdf_S) - ui*ui*w_1_nine_half - w_1_indep ); - oddPart = omegaOdd*(0.5*(pdf_N - pdf_S) - ui*w_1_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_N + pdf_S) - ui*ui*w_1_nine_half - w_1_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_N - pdf_S) - ui*w_1_x3 ); src[I(x, y + 1, z, D3Q19_N)] = pdf_N - evenPart - oddPart; src[I(x, y - 1, z, D3Q19_S)] = pdf_S - evenPart + oddPart; ui = ux; - evenPart = omegaEven*( 0.5*(pdf_E + pdf_W) - ui*ui*w_1_nine_half - w_1_indep ); - oddPart = omegaOdd*(0.5*(pdf_E - pdf_W) - ui*w_1_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_E + pdf_W) - ui*ui*w_1_nine_half - w_1_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_E - pdf_W) - ui*w_1_x3 ); src[I(x + 1, y, z, D3Q19_E)] = pdf_E - evenPart - oddPart; src[I(x - 1, y, z, D3Q19_W)] = pdf_W - evenPart + oddPart; ui = uz; - evenPart = omegaEven*( 0.5*(pdf_T + pdf_B) - ui*ui*w_1_nine_half - w_1_indep ); - oddPart = omegaOdd*(0.5*(pdf_T - pdf_B) - ui*w_1_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_T + pdf_B) - ui*ui*w_1_nine_half - w_1_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_T - pdf_B) - ui*w_1_x3 ); src[I(x, y, z + 1, D3Q19_T)] = pdf_T - evenPart - oddPart; src[I(x, y, z - 1, D3Q19_B)] = pdf_B - evenPart + oddPart; @@ -436,38 +449,38 @@ void FNAME(D3Q19AaKernel)(LatticeDesc * ld, KernelData * kernelData, CaseData * w_2_indep = w_2*dir_indep_trm; ui = -ux + uy; - evenPart = omegaEven*( 0.5*(pdf_NW + pdf_SE) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_NW - pdf_SE) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_NW + pdf_SE) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_NW - pdf_SE) - ui*w_2_x3 ); src[I(x - 1, y + 1, z, D3Q19_NW)] = pdf_NW - evenPart - oddPart; src[I(x + 1, y - 1, z, D3Q19_SE)] = pdf_SE - evenPart + oddPart; ui = ux + uy; - evenPart = omegaEven*( 0.5*(pdf_NE + pdf_SW) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_NE - pdf_SW) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_NE + pdf_SW) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_NE - pdf_SW) - ui*w_2_x3 ); src[I(x + 1, y + 1, z, D3Q19_NE)] = pdf_NE - evenPart - oddPart; src[I(x - 1, y - 1, z, D3Q19_SW)] = pdf_SW - evenPart + oddPart; ui = -ux + uz; - evenPart = omegaEven*( 0.5*(pdf_TW + pdf_BE) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_TW - pdf_BE) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_TW + pdf_BE) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_TW - pdf_BE) - ui*w_2_x3 ); src[I(x - 1, y, z + 1, D3Q19_TW)] = pdf_TW - evenPart - oddPart; src[I(x + 1, y, z - 1, D3Q19_BE)] = pdf_BE - evenPart + oddPart; ui = ux + uz; - evenPart = omegaEven*( 0.5*(pdf_TE + pdf_BW) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_TE - pdf_BW) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_TE + pdf_BW) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_TE - pdf_BW) - ui*w_2_x3 ); src[I(x + 1, y, z + 1, D3Q19_TE)] = pdf_TE - evenPart - oddPart; src[I(x - 1, y, z - 1, D3Q19_BW)] = pdf_BW - evenPart + oddPart; ui = -uy + uz; - evenPart = omegaEven*( 0.5*(pdf_TS + pdf_BN) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_TS - pdf_BN) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_TS + pdf_BN) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_TS - pdf_BN) - ui*w_2_x3 ); src[I(x, y - 1, z + 1, D3Q19_TS)] = pdf_TS - evenPart - oddPart; src[I(x, y + 1, z - 1, D3Q19_BN)] = pdf_BN - evenPart + oddPart; ui = uy + uz; - evenPart = omegaEven*( 0.5*(pdf_TN + pdf_BS) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_TN - pdf_BS) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_TN + pdf_BS) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_TN - pdf_BS) - ui*w_2_x3 ); src[I(x, y + 1, z + 1, D3Q19_TN)] = pdf_TN - evenPart - oddPart; src[I(x, y - 1, z - 1, D3Q19_BS)] = pdf_BS - evenPart + oddPart; @@ -488,6 +501,9 @@ void FNAME(D3Q19AaKernel)(LatticeDesc * ld, KernelData * kernelData, CaseData * #pragma omp parallel for default(none) \ shared(kd, src) #endif + #ifdef INTEL_OPT_DIRECTIVES + #pragma ivdep + #endif for (int i = 0; i < kd->nBounceBackPdfs; ++i) { src[kd->BounceBackPdfsDst[i]] = src[kd->BounceBackPdfsSrc[i]]; } diff --git a/src/BenchKernelD3Q19AaVec.c b/src/BenchKernelD3Q19AaVec.c index 2642c1c..f79e6cd 100644 --- a/src/BenchKernelD3Q19AaVec.c +++ b/src/BenchKernelD3Q19AaVec.c @@ -81,7 +81,7 @@ void DumpPdfs(LatticeDesc * ld, KernelData * kd, int zStart, int zStop, int iter // kd->GetNode(kd, x, y, z, pdfs); } else { - pdfs[dir] = -1.0; + pdfs[dir] = -F(1.0); } printf("%.16e ", pdfs[dir]); @@ -100,8 +100,8 @@ void FNAME(D3Q19AaVecKernel)(LatticeDesc * ld, KernelData * kd, CaseData * cd) Assert(kd != NULL); Assert(cd != NULL); - Assert(cd->Omega > 0.0); - Assert(cd->Omega < 2.0); + Assert(cd->Omega > F(0.0)); + Assert(cd->Omega < F(2.0)); KernelDataAa * kda = KDA(kd); @@ -233,8 +233,8 @@ static void KernelEven(LatticeDesc * ld, KernelData * kd, CaseData * cd) // {{{ Assert(kd != NULL); Assert(cd != NULL); - Assert(cd->Omega > 0.0); - Assert(cd->Omega < 2.0); + Assert(cd->Omega > F(0.0)); + Assert(cd->Omega < F(2.0)); KernelDataAa * kda = KDA(kd); @@ -256,19 +256,19 @@ static void KernelEven(LatticeDesc * ld, KernelData * kd, CaseData * cd) // {{{ PdfT omega = cd->Omega; PdfT omegaEven = omega; - PdfT magicParam = 1.0 / 12.0; - PdfT omegaOdd = 1.0 / (0.5 + magicParam / (1.0 / omega - 0.5)); + PdfT magicParam = F(1.0) / F(12.0); + PdfT omegaOdd = F(1.0) / (F(0.5) + magicParam / (F(1.0) / omega - F(0.5))); - const PdfT w_0 = 1.0 / 3.0; - const PdfT w_1 = 1.0 / 18.0; - const PdfT w_2 = 1.0 / 36.0; + const PdfT w_0 = F(1.0) / F(3.0); + const PdfT w_1 = F(1.0) / F(18.0); + const PdfT w_2 = F(1.0) / F(36.0); - const PdfT w_1_x3 = w_1 * 3.0; const PdfT w_1_nine_half = w_1 * 9.0 / 2.0; - const PdfT w_2_x3 = w_2 * 3.0; const PdfT w_2_nine_half = w_2 * 9.0 / 2.0; + const PdfT w_1_x3 = w_1 * F(3.0); const PdfT w_1_nine_half = w_1 * F(9.0) / F(2.0); + const PdfT w_2_x3 = w_2 * F(3.0); const PdfT w_2_nine_half = w_2 * F(9.0) / F(2.0); - VPDFT VONE_HALF = VSET(0.5); - VPDFT VTHREE_HALF = VSET(3.0 / 2.0); + VPDFT VONE_HALF = VSET(F(0.5)); + VPDFT VTHREE_HALF = VSET(F(3.0) / F(2.0)); VPDFT vw_1_indep, vw_2_indep; VPDFT vw_0 = VSET(w_0); @@ -427,8 +427,8 @@ static void KernelOdd(LatticeDesc * ld, KernelData * kd, CaseData * cd) // {{{ Assert(kd != NULL); Assert(cd != NULL); - Assert(cd->Omega > 0.0); - Assert(cd->Omega < 2.0); + Assert(cd->Omega > F(0.0)); + Assert(cd->Omega < F(2.0)); KernelDataAa * kda = KDA(kd); @@ -450,18 +450,18 @@ static void KernelOdd(LatticeDesc * ld, KernelData * kd, CaseData * cd) // {{{ PdfT omega = cd->Omega; PdfT omegaEven = omega; - PdfT magicParam = 1.0 / 12.0; - PdfT omegaOdd = 1.0 / (0.5 + magicParam / (1.0 / omega - 0.5)); + PdfT magicParam = F(1.0) / F(12.0); + PdfT omegaOdd = F(1.0) / (F(0.5) + magicParam / (F(1.0) / omega - F(0.5))); - const PdfT w_0 = 1.0 / 3.0; - const PdfT w_1 = 1.0 / 18.0; - const PdfT w_2 = 1.0 / 36.0; + const PdfT w_0 = F(1.0) / F(3.0); + const PdfT w_1 = F(1.0) / F(18.0); + const PdfT w_2 = F(1.0) / F(36.0); - const PdfT w_1_x3 = w_1 * 3.0; const PdfT w_1_nine_half = w_1 * 9.0 / 2.0; - const PdfT w_2_x3 = w_2 * 3.0; const PdfT w_2_nine_half = w_2 * 9.0 / 2.0; + const PdfT w_1_x3 = w_1 * F(3.0); const PdfT w_1_nine_half = w_1 * F(9.0) / F(2.0); + const PdfT w_2_x3 = w_2 * F(3.0); const PdfT w_2_nine_half = w_2 * F(9.0) / F(2.0); - VPDFT VONE_HALF = VSET(0.5); - VPDFT VTHREE_HALF = VSET(3.0 / 2.0); + VPDFT VONE_HALF = VSET(F(0.5)); + VPDFT VTHREE_HALF = VSET(F(3.0) / F(2.0)); VPDFT vw_1_indep, vw_2_indep; VPDFT vw_0 = VSET(w_0); diff --git a/src/BenchKernelD3Q19AaVecCommon.c b/src/BenchKernelD3Q19AaVecCommon.c index 57a9fae..483cbe7 100644 --- a/src/BenchKernelD3Q19AaVecCommon.c +++ b/src/BenchKernelD3Q19AaVecCommon.c @@ -332,7 +332,7 @@ void FNAME(D3Q19AaVecInit)(LatticeDesc * ld, KernelData ** kernelData, Parameter gDims[0] = lDims[0] + 2; gDims[1] = lDims[1] + 2; // TODO: fix this for aa-vec2-soa - gDims[2] = lDims[2] + 4; // one ghost cell in front, one in the back, plus at most two at the back for VSIZE = 4 + gDims[2] = lDims[2] + 2 + VSIZE - 2; // one ghost cell in front, one in the back, plus at most two at the back for VSIZE = 4 kd->Offsets[0] = 1; kd->Offsets[1] = 1; diff --git a/src/BenchKernelD3Q19AaVecSl.c b/src/BenchKernelD3Q19AaVecSl.c new file mode 100644 index 0000000..885a065 --- /dev/null +++ b/src/BenchKernelD3Q19AaVecSl.c @@ -0,0 +1,682 @@ +// -------------------------------------------------------------------------- +// +// Copyright +// Markus Wittmann, 2016-2017 +// RRZE, University of Erlangen-Nuremberg, Germany +// markus.wittmann -at- fau.de or hpc -at- rrze.fau.de +// +// Viktor Haag, 2016 +// LSS, University of Erlangen-Nuremberg, Germany +// +// This file is part of the Lattice Boltzmann Benchmark Kernels (LbmBenchKernels). +// +// LbmBenchKernels is free software: you can redistribute it and/or modify +// it under the terms of the GNU General Public License as published by +// the Free Software Foundation, either version 3 of the License, or +// (at your option) any later version. +// +// LbmBenchKernels is distributed in the hope that it will be useful, +// but WITHOUT ANY WARRANTY; without even the implied warranty of +// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +// GNU General Public License for more details. +// +// You should have received a copy of the GNU General Public License +// along with LbmBenchKernels. If not, see . +// +// -------------------------------------------------------------------------- +#include "BenchKernelD3Q19AaVecCommon.h" + +#include "Memory.h" +#include "Vtk.h" +#include "LikwidIf.h" +#include "Vector.h" +#include "Vector.h" + +#include +#include + +#ifdef _OPENMP + #include +#endif + +static void KernelEven(LatticeDesc * ld, KernelData * kd, CaseData * cd); +static void KernelOddVecSl(LatticeDesc * ld, KernelData * kd, CaseData * cd); + +#if 1 // {{{ +void DumpPdfs(LatticeDesc * ld, KernelData * kd, int zStart, int zStop, int iter, const char * prefix, int dir) +{ + int * gDims = kd->GlobalDims; + + int nX = gDims[0]; + int nY = gDims[1]; + // int nZ = gDims[2]; + + PdfT pdfs[N_D3Q19]; + + int localZStart = zStart; + int localZStop = zStop; + + if (localZStart == -1) localZStart = 0; + if (localZStop == -1) localZStop = gDims[2] - 1; + + printf("D iter: %d dir: %d %s\n", iter, dir, D3Q19_NAMES[dir]); + +// for (int dir = 0; dir < 19; ++dir) { + for (int z = localZStop; z >= localZStart; --z) { + printf("D [%2d][%2d][%s] plane % 2d\n", iter, dir, prefix, z); + + for(int y = 0; y < nY; ++y) { + // for(int y = 2; y < nY - 2; ++y) { + printf("D [%2d][%2d][%s] %2d ", iter, dir, prefix, y); + + for(int x = 0; x < nX; ++x) { + + if (1) { // ld->Lattice[L_INDEX_4(ld->Dims, x, y, z)] != LAT_CELL_OBSTACLE) { + + #define I(x, y, z, dir) P_INDEX_5(gDims, (x), (y), (z), (dir)) + pdfs[dir] = kd->PdfsActive[I(x, y, z, dir)]; + #undef I + } + else { + pdfs[dir] = -1.0; + } + + printf("%.16e ", pdfs[dir]); + // printf("%08.0f ", pdfs[dir]); + } + + printf("\n"); + } + } +// } +} +#endif // }}} + +void FNAME(D3Q19AaVecSlKernel)(LatticeDesc * ld, KernelData * kd, CaseData * cd) +{ + Assert(ld != NULL); + Assert(kd != NULL); + Assert(cd != NULL); + + Assert(cd->Omega > 0.0); + Assert(cd->Omega < 2.0); + + KernelDataAa * kda = KDA(kd); + + PdfT * src = kd->PdfsActive; + + int maxIterations = cd->MaxIterations; + + #ifdef VTK_OUTPUT + if (cd->VtkOutput) { + kd->PdfsActive = src; + VtkWrite(ld, kd, cd, -1); + } + #endif + + #ifdef STATISTICS + kd->PdfsActive = src; + KernelStatistics(kd, ld, cd, 0); + #endif + + Assert((maxIterations % 2) == 0); + + #ifdef _OPENMP + #pragma omp parallel default(none) shared(kda, kd, ld, cd, src, maxIterations) + #endif + { + for (int iter = 0; iter < maxIterations; iter += 2) { + + // -------------------------------------------------------------------- + // even time step + // -------------------------------------------------------------------- + + X_LIKWID_START("aa-vec-even"); + + KernelEven(ld, kd, cd); + #ifdef _OPENMP + #pragma omp barrier + #endif + + X_LIKWID_STOP("aa-vec-even"); + + // Fixup bounce back PDFs. + #ifdef _OPENMP + #pragma omp for + #endif + #ifdef INTEL_OPT_DIRECTIVES + #pragma ivdep + #endif + for (int i = 0; i < kd->nBounceBackPdfs; ++i) { + src[kd->BounceBackPdfsSrc[i]] = src[kd->BounceBackPdfsDst[i]]; + } + + #ifdef _OPENMP + #pragma omp single + #endif + { + // save current iteration + kda->Iteration = iter; + + #ifdef VERIFICATION + kd->PdfsActive = src; + KernelAddBodyForce(kd, ld, cd); + #endif + + #ifdef VTK_OUTPUT + if (cd->VtkOutput && (iter % cd->VtkModulus) == 0) { + kd->PdfsActive = src; + VtkWrite(ld, kd, cd, iter); + } + #endif + + #ifdef STATISTICS + kd->PdfsActive = src; + KernelStatistics(kd, ld, cd, iter); + #endif + } + #ifdef _OPENMP + #pragma omp barrier + #endif + + + // -------------------------------------------------------------------- + // odd time step + // -------------------------------------------------------------------- + + X_LIKWID_START("aa-vec-odd"); + + + KernelOddVecSl(ld, kd, cd); + #ifdef _OPENMP + #pragma omp barrier + #endif + + // Stop counters before bounce back. Else computing loop balance will + // be incorrect. + + X_LIKWID_STOP("aa-vec-odd"); + + // Fixup bounce back PDFs. + #ifdef _OPENMP + #pragma omp for + #endif + #ifdef INTEL_OPT_DIRECTIVES + #pragma ivdep + #endif + for (int i = 0; i < kd->nBounceBackPdfs; ++i) { + src[kd->BounceBackPdfsDst[i]] = src[kd->BounceBackPdfsSrc[i]]; + } + + #ifdef _OPENMP + #pragma omp single + #endif + { + // save current iteration + kda->Iteration = iter + 1; + + #ifdef VERIFICATION + kd->PdfsActive = src; + KernelAddBodyForce(kd, ld, cd); + #endif + + #ifdef VTK_OUTPUT + if (cd->VtkOutput && ((iter + 1) % cd->VtkModulus) == 0) { + kd->PdfsActive = src; + VtkWrite(ld, kd, cd, iter + 1); + } + #endif + + #ifdef STATISTICS + kd->PdfsActive = src; + KernelStatistics(kd, ld, cd, iter + 1); + #endif + } + #ifdef _OPENMP + #pragma omp barrier + #endif + } // for (int iter = 0; ... + } // omp parallel + + #ifdef VTK_OUTPUT + + if (cd->VtkOutput) { + kd->PdfsActive = src; + VtkWrite(ld, kd, cd, maxIterations); + } + + #endif + + return; +} + +static void KernelEven(LatticeDesc * ld, KernelData * kd, CaseData * cd) // {{{ +{ + Assert(ld != NULL); + Assert(kd != NULL); + Assert(cd != NULL); + + Assert(cd->Omega > F(0.0)); + Assert(cd->Omega < F(2.0)); + + KernelDataAa * kda = KDA(kd); + + int nX = ld->Dims[0]; + int nY = ld->Dims[1]; + int nZ = ld->Dims[2]; + + int * gDims = kd->GlobalDims; + + int oX = kd->Offsets[0]; + int oY = kd->Offsets[1]; + int oZ = kd->Offsets[2]; + + int blk[3]; + blk[0] = kda->Blk[0]; + blk[1] = kda->Blk[1]; + blk[2] = kda->Blk[2]; + + PdfT omega = cd->Omega; + PdfT omegaEven = omega; + + PdfT magicParam = F(1.0) / F(12.0); + PdfT omegaOdd = F(1.0) / (F(0.5) + magicParam / (F(1.0) / omega - F(0.5))); + + const PdfT w_0 = F(1.0) / F( 3.0); + const PdfT w_1 = F(1.0) / F(18.0); + const PdfT w_2 = F(1.0) / F(36.0); + + const PdfT w_1_x3 = w_1 * F(3.0); const PdfT w_1_nine_half = w_1 * F(9.0) / F(2.0); + const PdfT w_2_x3 = w_2 * F(3.0); const PdfT w_2_nine_half = w_2 * F(9.0) / F(2.0); + + + VPDFT VONE_HALF = VSET(F(0.5)); + VPDFT VTHREE_HALF = VSET(F(3.0) / F(2.0)); + + VPDFT vw_1_indep, vw_2_indep; + VPDFT vw_0 = VSET(w_0); + VPDFT vw_1 = VSET(w_1); + VPDFT vw_2 = VSET(w_2); + + VPDFT vw_1_x3 = VSET(w_1_x3); + VPDFT vw_2_x3 = VSET(w_2_x3); + VPDFT vw_1_nine_half = VSET(w_1_nine_half); + VPDFT vw_2_nine_half = VSET(w_2_nine_half); + + VPDFT vui, vux, vuy, vuz, vdens; + + VPDFT vevenPart, voddPart, vdir_indep_trm; + + VPDFT vomegaEven = VSET(omegaEven); + VPDFT vomegaOdd = VSET(omegaOdd); + + VPDFT vpdf_a, vpdf_b; + + // Declare pdf_N, pdf_E, pdf_S, pdf_W, ... + #define X(name, idx, idxinv, x, y, z) VPDFT JOIN(vpdf_,name); PdfT * JOIN(ppdf_,name); + D3Q19_LIST + #undef X + + PdfT * src = kd->Pdfs[0]; + + int nThreads = 1; + int threadId = 0; + + #ifdef _OPENMP + nThreads = omp_get_max_threads(); + threadId = omp_get_thread_num(); + #endif + + const int nodesPlane = gDims[1] * gDims[2]; + const int nodesCol = gDims[2]; + + #define I(x, y, z, dir) P_INDEX_5(gDims, (x), (y), (z), (dir)) + +// TODO: make inline function out of macros. + + #define IMPLODE(_x, _y, _z) (nodesPlane * (_x) + nodesCol * (_y) + (_z)) + #define EXPLODE(index, _x, _y, _z) _x = index / (nodesPlane); _y = (index - nodesPlane * (_x)) / nodesCol; _z = index - nodesPlane * (_x) - nodesCol * (_y); + + int startX = oX; + int startY = oY; + int startZ = oZ; + + int indexStart = IMPLODE(startX, startY, startZ); + int indexEnd = IMPLODE(startX + nX - 1, startY + nY - 1, startZ + nZ - 1); + + // How many cells as multiples of VSIZE do we have (rounded up)? + int nVCells = (indexEnd - indexStart + 1 + VSIZE - 1) / VSIZE; + + int threadStart = nVCells / nThreads * threadId; + int threadEnd = nVCells / nThreads * (threadId + 1); + + if (nVCells % nThreads > threadId) { + threadStart += threadId; + threadEnd += threadId + 1; + } + else { + threadStart += nVCells % nThreads; + threadEnd += nVCells % nThreads; + } + + threadStart *= VSIZE; + threadEnd *= VSIZE; + + // As threadStart/End is now in the granularity of cells we add the start offset. + threadStart += indexStart; + threadEnd += indexStart; + + EXPLODE(threadStart, startX, startY, startZ); + + #undef EXPLODE + #undef IMPLODE + + #define X(name, idx, idxinv, _x, _y, _z) JOIN(ppdf_,name) = &src[I(startX, startY, startZ, idx)]; + D3Q19_LIST + #undef X + + // printf("e thread %d idx start: %d end: %d thread start: %d end: %d\n", + // threadId, indexStart, indexEnd, threadStart, threadEnd); + + + for (int i = threadStart; i < threadEnd; i += VSIZE) { + + // Load PDFs of local cell: pdf_N = src[I(x, y, z, D3Q19_N)]; ... + // #define X(name, idx, idxinv, _x, _y, _z) JOIN(vpdf_,name) = VLDU(&src[I(x, y, z, idx)]); + #define X(name, idx, idxinv, _x, _y, _z) JOIN(vpdf_,name) = VLDU(JOIN(ppdf_,name)); + D3Q19_LIST + #undef X + + + vux = VSUB(VSUB(VSUB(VSUB(VSUB(VADD(VADD(vpdf_E,VADD(vpdf_NE,vpdf_SE)),VADD(vpdf_TE,vpdf_BE)),vpdf_W),vpdf_NW),vpdf_SW),vpdf_TW),vpdf_BW); + vuy = VSUB(VSUB(VSUB(VSUB(VSUB(VADD(VADD(vpdf_N,VADD(vpdf_NE,vpdf_NW)),VADD(vpdf_TN,vpdf_BN)),vpdf_S),vpdf_SE),vpdf_SW),vpdf_TS),vpdf_BS); + vuz = VSUB(VSUB(VSUB(VSUB(VSUB(VADD(VADD(vpdf_T,VADD(vpdf_TE,vpdf_TW)),VADD(vpdf_TN,vpdf_TS)),vpdf_B),vpdf_BE),vpdf_BW),vpdf_BN),vpdf_BS); + + vdens = VADD(VADD(VADD(VADD(VADD(VADD(VADD(VADD(VADD(vpdf_C,VADD(vpdf_N,vpdf_E)),VADD(vpdf_S,vpdf_W)),VADD(vpdf_NE,vpdf_SE)), + VADD(vpdf_SW,vpdf_NW)),VADD(vpdf_T,vpdf_TN)),VADD(vpdf_TE,vpdf_TS)),VADD(vpdf_TW,vpdf_B)), + VADD(vpdf_BN,vpdf_BE)),VADD(vpdf_BS,vpdf_BW)); + + vdir_indep_trm = VSUB(vdens,VMUL(VADD(VADD(VMUL(vux,vux),VMUL(vuy,vuy)),VMUL(vuz,vuz)),VTHREE_HALF)); + + VSTU(ppdf_C, VSUB(vpdf_C,VMUL(vomegaEven,VSUB(vpdf_C,VMUL(vw_0,vdir_indep_trm))))); + + vw_1_indep = VMUL(vw_1,vdir_indep_trm); + vw_2_indep = VMUL(vw_2,vdir_indep_trm); + +#if defined(LOOP_1) || defined(LOOP_2) + #error Loop macros are not allowed to be defined here. +#endif + + #define LOOP_1(_dir1, _dir2, _vel) \ + vui = _vel; \ + vpdf_a = JOIN(vpdf_,_dir1); \ + vpdf_b = JOIN(vpdf_,_dir2); \ + \ + vevenPart = VMUL(vomegaEven, VSUB(VSUB(VMUL(VONE_HALF, VADD(vpdf_a, vpdf_b)), VMUL(vui, VMUL(vui, vw_1_nine_half))), vw_1_indep)); \ + voddPart = VMUL(vomegaOdd, VSUB( VMUL(VONE_HALF, VSUB(vpdf_a, vpdf_b)), VMUL(vui, vw_1_x3))); \ + \ + VSTU(JOIN(ppdf_,_dir2), VSUB(VSUB(vpdf_a, vevenPart), voddPart)); \ + VSTU(JOIN(ppdf_,_dir1), VADD(VSUB(vpdf_b, vevenPart), voddPart)); + + #define LOOP_2(_dir1, _dir2, _expr) \ + vui = _expr; \ + vpdf_a = JOIN(vpdf_,_dir1); \ + vpdf_b = JOIN(vpdf_,_dir2); \ + \ + vevenPart = VMUL(vomegaEven, VSUB(VSUB(VMUL(VONE_HALF, VADD(vpdf_a, vpdf_b)), VMUL(vui, VMUL(vui, vw_2_nine_half))), vw_2_indep)); \ + voddPart = VMUL(vomegaOdd, VSUB( VMUL(VONE_HALF, VSUB(vpdf_a, vpdf_b)), VMUL(vui, vw_2_x3))); \ + \ + VSTU(JOIN(ppdf_,_dir2), VSUB(VSUB(vpdf_a, vevenPart), voddPart)); \ + VSTU(JOIN(ppdf_,_dir1), VADD(VSUB(vpdf_b, vevenPart), voddPart)); + + LOOP_1(N, S, vuy); + LOOP_1(E, W, vux); + LOOP_1(T, B, vuz); + + LOOP_2(NW, SE, VSUB(vuy, vux)); + LOOP_2(NE, SW, VADD(vuy, vux)); + LOOP_2(TW, BE, VSUB(vuz, vux)); + LOOP_2(TE, BW, VADD(vuz, vux)); + LOOP_2(TS, BN, VSUB(vuz, vuy)); + LOOP_2(TN, BS, VADD(vuz, vuy)); + + #undef LOOP_1 + #undef LOOP_2 + + #define X(name, idx, idxinv, _x, _y, _z) JOIN(ppdf_,name) += VSIZE; + D3Q19_LIST + #undef X + } + + #undef I + + return; +} // }}} + + +static void KernelOddVecSl(LatticeDesc * ld, KernelData * kd, CaseData * cd) // {{{ +{ + Assert(ld != NULL); + Assert(kd != NULL); + Assert(cd != NULL); + + Assert(cd->Omega > 0.0); + Assert(cd->Omega < F(2.0)); + + KernelDataAa * kda = KDA(kd); + + int nX = ld->Dims[0]; + int nY = ld->Dims[1]; + int nZ = ld->Dims[2]; + + int * gDims = kd->GlobalDims; + + int oX = kd->Offsets[0]; + int oY = kd->Offsets[1]; + int oZ = kd->Offsets[2]; + + int blk[3]; + blk[0] = kda->Blk[0]; + blk[1] = kda->Blk[1]; + blk[2] = kda->Blk[2]; + + PdfT omega = cd->Omega; + PdfT omegaEven = omega; + + PdfT magicParam = F(1.0) / F(12.0); + PdfT omegaOdd = F(1.0) / (F(0.5) + magicParam / (F(1.0) / omega - F(0.5))); + + const PdfT w_0 = F(1.0) / F( 3.0); + const PdfT w_1 = F(1.0) / F(18.0); + const PdfT w_2 = F(1.0) / F(36.0); + + const PdfT w_1_x3 = w_1 * F(3.0); const PdfT w_1_nine_half = w_1 * F(9.0) / F(2.0); + const PdfT w_2_x3 = w_2 * F(3.0); const PdfT w_2_nine_half = w_2 * F(9.0) / F(2.0); + + VPDFT VONE_HALF = VSET(F(0.5)); + VPDFT VTHREE_HALF = VSET(F(3.0) / F(2.0)); + + VPDFT vw_1_indep, vw_2_indep; + VPDFT vw_0 = VSET(w_0); + VPDFT vw_1 = VSET(w_1); + VPDFT vw_2 = VSET(w_2); + + VPDFT vw_1_x3 = VSET(w_1_x3); + VPDFT vw_2_x3 = VSET(w_2_x3); + VPDFT vw_1_nine_half = VSET(w_1_nine_half); + VPDFT vw_2_nine_half = VSET(w_2_nine_half); + + VPDFT vui, vux, vuy, vuz, vdens; + + VPDFT vevenPart, voddPart, vdir_indep_trm; + + VPDFT vomegaEven = VSET(omegaEven); + VPDFT vomegaOdd = VSET(omegaOdd); + + VPDFT vpdf_a, vpdf_b; + + // Declare pdf_N, pdf_E, pdf_S, pdf_W, ... + #define X(name, idx, idxinv, x, y, z) VPDFT JOIN(vpdf_,name); PdfT * JOIN(ppdf_,idx); + D3Q19_LIST + #undef X + + PdfT * src = kd->Pdfs[0]; + + int nThreads = 1; + int threadId = 0; + + #ifdef _OPENMP + nThreads = omp_get_max_threads(); + threadId = omp_get_thread_num(); + #endif + + const int nodesPlane = gDims[1] * gDims[2]; + const int nodesCol = gDims[2]; + + #define I(x, y, z, dir) P_INDEX_5(gDims, (x), (y), (z), (dir)) + +// TODO: make inline function out of macros. + + #define IMPLODE(_x, _y, _z) (nodesPlane * (_x) + nodesCol * (_y) + (_z)) + #define EXPLODE(index, _x, _y, _z) _x = index / (nodesPlane); _y = (index - nodesPlane * (_x)) / nodesCol; _z = index - nodesPlane * (_x) - nodesCol * (_y); + + int startX = oX; + int startY = oY; + int startZ = oZ; + + int indexStart = IMPLODE(startX, startY, startZ); + int indexEnd = IMPLODE(startX + nX - 1, startY + nY - 1, startZ + nZ - 1); + + // How many multiples of VSIZE cells (rounded up) do we have? + int nVCells = (indexEnd - indexStart + 1 + VSIZE - 1) / VSIZE; + + int threadStart = nVCells / nThreads * threadId; + int threadEnd = nVCells / nThreads * (threadId + 1); + + if (nVCells % nThreads > threadId) { + threadStart += threadId; + threadEnd += threadId + 1; + } + else { + threadStart += nVCells % nThreads; + threadEnd += nVCells % nThreads; + } + + threadStart *= VSIZE; + threadEnd *= VSIZE; + + // As threadStart/End is now in the granularity of cells we add the start offset. + threadStart += indexStart; + threadEnd += indexStart; + + EXPLODE(threadStart, startX, startY, startZ); + + #undef EXPLODE + #undef IMPLODE + + // printf("o thread %d idx start: %d end: %d thread start: %d end: %d\n", + // threadId, indexStart, indexEnd, threadStart, threadEnd); + + #define X(name, idx, idxinv, _x, _y, _z) JOIN(ppdf_,idx) = &src[I(startX + _x, startY + _y, startZ + _z, idx)]; + D3Q19_LIST + #undef X + +#if DEBUG_EXTENDED + + #define X(name, idx, idxinv, x, y, z) PdfT * JOIN(ppdf_start_,idx), * JOIN(ppdf_end_,idx); + D3Q19_LIST + #undef X + + #define X(name, idx, idxinv, _x, _y, _z) JOIN(ppdf_start_,idx) = &src[I(startX + _x, startY + _y, startZ + _z, idx)]; + D3Q19_LIST + #undef X + + #define X(name, idx, idxinv, _x, _y, _z) JOIN(ppdf_end_,idx) = &src[I(startX + nX - 1 + _x, startY + nY - 1 + _y, startZ + nZ - 1 + _z, idx)]; + D3Q19_LIST + #undef X + +#if 0 + #define X(name, idx, idxinv, _x, _y, _z) printf("%2s ppdf_%d = %p (%d %d %d) (%d %d %d)\n", STRINGIFY(name), idx, JOIN(ppdf_,idx), \ +startX , startY , startZ , startX + _x, startY + _y, startZ + _z); + D3Q19_LIST + #undef X +#endif + +#endif // DEBUG_EXTENDED + + + for (int i = threadStart; i < threadEnd; i += VSIZE) { + +#if DEBUG_EXTENDED + #define X(name, idx, idxinv, _x, _y, _z) Assert((unsigned long)(JOIN(ppdf_,idx)) >= (unsigned long)(JOIN(ppdf_start_,idx))); Assert((unsigned long)(JOIN(ppdf_,idx)) <= (unsigned long)(JOIN(ppdf_end_,idx))); + D3Q19_LIST + #undef X +#endif + + #define X(name, idx, idxinv, _x, _y, _z) JOIN(vpdf_,name) = VLDU(JOIN(ppdf_,idxinv)); + D3Q19_LIST + #undef X + + vux = VSUB(VSUB(VSUB(VSUB(VSUB(VADD(VADD(vpdf_E,VADD(vpdf_NE,vpdf_SE)),VADD(vpdf_TE,vpdf_BE)),vpdf_W),vpdf_NW),vpdf_SW),vpdf_TW),vpdf_BW); + vuy = VSUB(VSUB(VSUB(VSUB(VSUB(VADD(VADD(vpdf_N,VADD(vpdf_NE,vpdf_NW)),VADD(vpdf_TN,vpdf_BN)),vpdf_S),vpdf_SE),vpdf_SW),vpdf_TS),vpdf_BS); + vuz = VSUB(VSUB(VSUB(VSUB(VSUB(VADD(VADD(vpdf_T,VADD(vpdf_TE,vpdf_TW)),VADD(vpdf_TN,vpdf_TS)),vpdf_B),vpdf_BE),vpdf_BW),vpdf_BN),vpdf_BS); + + vdens = VADD(VADD(VADD(VADD(VADD(VADD(VADD(VADD(VADD(vpdf_C,VADD(vpdf_N,vpdf_E)),VADD(vpdf_S,vpdf_W)),VADD(vpdf_NE,vpdf_SE)), + VADD(vpdf_SW,vpdf_NW)),VADD(vpdf_T,vpdf_TN)),VADD(vpdf_TE,vpdf_TS)),VADD(vpdf_TW,vpdf_B)),VADD(vpdf_BN,vpdf_BE)),VADD(vpdf_BS,vpdf_BW)); + + vdir_indep_trm = VSUB(vdens,VMUL(VADD(VADD(VMUL(vux,vux),VMUL(vuy,vuy)),VMUL(vuz,vuz)),VTHREE_HALF)); + + // ppdf_18 is the pointer to the center pdfs. + VSTU(ppdf_18, VSUB(vpdf_C,VMUL(vomegaEven,VSUB(vpdf_C,VMUL(vw_0,vdir_indep_trm))))); + + vw_1_indep = VMUL(vw_1,vdir_indep_trm); + vw_2_indep = VMUL(vw_2,vdir_indep_trm); + +#if defined(LOOP_1) || defined(LOOP_2) + #error Loop macros are not allowed to be defined here. +#endif + + #define LOOP_1(_dir1, _dir2, _idx1, _idx2, _vel) \ + vui = _vel; \ + vpdf_a = JOIN(vpdf_,_dir1); \ + vpdf_b = JOIN(vpdf_,_dir2); \ + \ + vevenPart = VMUL(vomegaEven, VSUB(VSUB(VMUL(VONE_HALF, VADD(vpdf_a, vpdf_b)), VMUL(vui, VMUL(vui, vw_1_nine_half))), vw_1_indep)); \ + voddPart = VMUL(vomegaOdd, VSUB( VMUL(VONE_HALF, VSUB(vpdf_a, vpdf_b)), VMUL(vui, vw_1_x3))); \ + \ + VSTU(JOIN(ppdf_,_idx1), VSUB(VSUB(vpdf_a, vevenPart), voddPart)); \ + VSTU(JOIN(ppdf_,_idx2), VADD(VSUB(vpdf_b, vevenPart), voddPart)); + + #define LOOP_2(_dir1, _dir2, _idx1, _idx2, _expr) \ + vui = _expr; \ + vpdf_a = JOIN(vpdf_,_dir1); \ + vpdf_b = JOIN(vpdf_,_dir2); \ + \ + vevenPart = VMUL(vomegaEven, VSUB(VSUB(VMUL(VONE_HALF, VADD(vpdf_a, vpdf_b)), VMUL(vui, VMUL(vui, vw_2_nine_half))), vw_2_indep)); \ + voddPart = VMUL(vomegaOdd, VSUB( VMUL(VONE_HALF, VSUB(vpdf_a, vpdf_b)), VMUL(vui, vw_2_x3))); \ + \ + VSTU(JOIN(ppdf_,_idx1), VSUB(VSUB(vpdf_a, vevenPart), voddPart)); \ + VSTU(JOIN(ppdf_,_idx2), VADD(VSUB(vpdf_b, vevenPart), voddPart)); + + + LOOP_1(N, S, D3Q19_N, D3Q19_S, vuy); + LOOP_1(E, W, D3Q19_E, D3Q19_W, vux); + LOOP_1(T, B, D3Q19_T, D3Q19_B, vuz); + + LOOP_2(NW, SE, D3Q19_NW, D3Q19_SE, VSUB(vuy, vux)); + LOOP_2(NE, SW, D3Q19_NE, D3Q19_SW, VADD(vuy, vux)); + LOOP_2(TW, BE, D3Q19_TW, D3Q19_BE, VSUB(vuz, vux)); + LOOP_2(TE, BW, D3Q19_TE, D3Q19_BW, VADD(vuz, vux)); + LOOP_2(TS, BN, D3Q19_TS, D3Q19_BN, VSUB(vuz, vuy)); + LOOP_2(TN, BS, D3Q19_TN, D3Q19_BS, VADD(vuz, vuy)); + + #define X(name, idx, idxinv, _x, _y, _z) JOIN(ppdf_,idx) += VSIZE; + D3Q19_LIST + #undef X + } + + #undef I + + return; + +} // }}} diff --git a/src/BenchKernelD3Q19AaVecSl.h b/src/BenchKernelD3Q19AaVecSl.h new file mode 100644 index 0000000..10ea6ee --- /dev/null +++ b/src/BenchKernelD3Q19AaVecSl.h @@ -0,0 +1,38 @@ +// -------------------------------------------------------------------------- +// +// Copyright +// Markus Wittmann, 2016-2017 +// RRZE, University of Erlangen-Nuremberg, Germany +// markus.wittmann -at- fau.de or hpc -at- rrze.fau.de +// +// Viktor Haag, 2016 +// LSS, University of Erlangen-Nuremberg, Germany +// +// This file is part of the Lattice Boltzmann Benchmark Kernels (LbmBenchKernels). +// +// LbmBenchKernels is free software: you can redistribute it and/or modify +// it under the terms of the GNU General Public License as published by +// the Free Software Foundation, either version 3 of the License, or +// (at your option) any later version. +// +// LbmBenchKernels is distributed in the hope that it will be useful, +// but WITHOUT ANY WARRANTY; without even the implied warranty of +// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +// GNU General Public License for more details. +// +// You should have received a copy of the GNU General Public License +// along with LbmBenchKernels. If not, see . +// +// -------------------------------------------------------------------------- +#ifndef __BENCH_KERNEL_D3Q19_AA_VEC_SL__ +#define __BENCH_KERNEL_D3Q19_AA_VEC_SL__ + +#include "Kernel.h" + + +void D3Q19AaVecSlInit_AaSoA(LatticeDesc * ld, KernelData ** kernelData, Parameters * params); +void D3Q19AaVecSlDeinit_AaSoA(LatticeDesc * ld, KernelData ** kernelData); + + + +#endif // __BENCH_KERNEL_D3Q19_AA_VEC_SL__ diff --git a/src/BenchKernelD3Q19AaVecSlCommon.c b/src/BenchKernelD3Q19AaVecSlCommon.c new file mode 100644 index 0000000..2c89ea6 --- /dev/null +++ b/src/BenchKernelD3Q19AaVecSlCommon.c @@ -0,0 +1,60 @@ +// -------------------------------------------------------------------------- +// +// Copyright +// Markus Wittmann, 2016-2017 +// RRZE, University of Erlangen-Nuremberg, Germany +// markus.wittmann -at- fau.de or hpc -at- rrze.fau.de +// +// Viktor Haag, 2016 +// LSS, University of Erlangen-Nuremberg, Germany +// +// This file is part of the Lattice Boltzmann Benchmark Kernels (LbmBenchKernels). +// +// LbmBenchKernels is free software: you can redistribute it and/or modify +// it under the terms of the GNU General Public License as published by +// the Free Software Foundation, either version 3 of the License, or +// (at your option) any later version. +// +// LbmBenchKernels is distributed in the hope that it will be useful, +// but WITHOUT ANY WARRANTY; without even the implied warranty of +// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +// GNU General Public License for more details. +// +// You should have received a copy of the GNU General Public License +// along with LbmBenchKernels. If not, see . +// +// -------------------------------------------------------------------------- +#include "BenchKernelD3Q19AaVecSlCommon.h" +#include "BenchKernelD3Q19AaVec.h" + + +#include "Memory.h" +#include "Vtk.h" +#include "Vector.h" + +#include +#include + +#ifdef _OPENMP + #include +#endif + +// Forward definition. +void FNAME(D3Q19AaVecSlKernel)(LatticeDesc * ld, struct KernelData_ * kd, CaseData * cd); + +void FNAME(D3Q19AaVecSlInit)(LatticeDesc * ld, KernelData ** kd, Parameters * params) +{ + FNAME(D3Q19AaVecInit)(ld, kd, params); + + (*kd)->Kernel = FNAME(D3Q19AaVecSlKernel); + + return; +} + +void FNAME(D3Q19AaVecSlDeinit)(LatticeDesc * ld, KernelData ** kd) +{ + FNAME(D3Q19AaVecDeinit)(ld, kd); + + return; +} + diff --git a/src/BenchKernelD3Q19AaVecSlCommon.h b/src/BenchKernelD3Q19AaVecSlCommon.h new file mode 100644 index 0000000..bc76113 --- /dev/null +++ b/src/BenchKernelD3Q19AaVecSlCommon.h @@ -0,0 +1,37 @@ +// -------------------------------------------------------------------------- +// +// Copyright +// Markus Wittmann, 2016-2017 +// RRZE, University of Erlangen-Nuremberg, Germany +// markus.wittmann -at- fau.de or hpc -at- rrze.fau.de +// +// Viktor Haag, 2016 +// LSS, University of Erlangen-Nuremberg, Germany +// +// This file is part of the Lattice Boltzmann Benchmark Kernels (LbmBenchKernels). +// +// LbmBenchKernels is free software: you can redistribute it and/or modify +// it under the terms of the GNU General Public License as published by +// the Free Software Foundation, either version 3 of the License, or +// (at your option) any later version. +// +// LbmBenchKernels is distributed in the hope that it will be useful, +// but WITHOUT ANY WARRANTY; without even the implied warranty of +// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +// GNU General Public License for more details. +// +// You should have received a copy of the GNU General Public License +// along with LbmBenchKernels. If not, see . +// +// -------------------------------------------------------------------------- +#ifndef __BENCH_KERNEL_D3Q19_AA_VEC_SL_COMMON_H__ +#define __BENCH_KERNEL_D3Q19_AA_VEC_SL_COMMON_H__ + + +#include "Kernel.h" + +#include "BenchKernelD3Q19AaVecCommon.h" + + +#endif // __BENCH_KERNEL_D3Q19_AA_VEC_SL_COMMON_H__ + diff --git a/src/BenchKernelD3Q19List.c b/src/BenchKernelD3Q19List.c index e01853a..4adb858 100644 --- a/src/BenchKernelD3Q19List.c +++ b/src/BenchKernelD3Q19List.c @@ -40,28 +40,27 @@ void FNAME(D3Q19ListKernel)(LatticeDesc * ld, KernelData * kernelData, CaseData Assert(kernelData != NULL); Assert(cd != NULL); - Assert(cd->Omega > 0.0); - Assert(cd->Omega < 2.0); + Assert(cd->Omega > F(0.0)); + Assert(cd->Omega < F(2.0)); KernelData * kd = (KernelData *)kernelData; KernelDataList * kdl = (KernelDataList *)kernelData; PdfT omega = cd->Omega; PdfT omegaEven = omega; -// PdfT omegaOdd = 8.0*((2.0-omegaEven)/(8.0-omegaEven)); //"standard" trt odd relaxation parameter - PdfT magicParam = 1.0/12.0; // 1/4: best stability; 1/12: removes third-order advection error (best advection); 1/6: removes fourth-order diffusion error (best diffusion); 3/16: exact location of bounce back for poiseuille flow - PdfT omegaOdd = 1.0/( 0.5 + magicParam/(1.0/omega - 0.5) ); + PdfT magicParam = F(1.0) / F(12.0); + PdfT omegaOdd = F(1.0) /(F(0.5) + magicParam / (F(1.0) / omega - F(0.5))); - PdfT evenPart = 0.0; - PdfT oddPart = 0.0; - PdfT dir_indep_trm = 0.0; + PdfT evenPart = F(0.0); + PdfT oddPart = F(0.0); + PdfT dir_indep_trm = F(0.0); - PdfT w_0 = 1.0 / 3.0; - PdfT w_1 = 1.0 / 18.0; - PdfT w_2 = 1.0 / 36.0; + PdfT w_0 = F(1.0) / F( 3.0); + PdfT w_1 = F(1.0) / F(18.0); + PdfT w_2 = F(1.0) / F(36.0); - PdfT w_1_x3 = w_1 * 3.0; PdfT w_1_nine_half = w_1 * 9.0/2.0; PdfT w_1_indep = 0.0; - PdfT w_2_x3 = w_2 * 3.0; PdfT w_2_nine_half = w_2 * 9.0/2.0; PdfT w_2_indep = 0.0; + PdfT w_1_x3 = w_1 * F(3.0); PdfT w_1_nine_half = w_1 * F(9.0) / F(2.0); PdfT w_1_indep = F(0.0); + PdfT w_2_x3 = w_2 * F(3.0); PdfT w_2_nine_half = w_2 * F(9.0) / F(2.0); PdfT w_2_indep = F(0.0); PdfT ux, uy, uz, ui; PdfT dens; @@ -112,6 +111,9 @@ void FNAME(D3Q19ListKernel)(LatticeDesc * ld, KernelData * kernelData, CaseData pdf_B, pdf_BN, pdf_BE, pdf_BS, pdf_BW, \ evenPart, oddPart, w_1_indep, w_2_indep) #endif + #ifdef INTEL_OPT_DIRECTIVES + #pragma ivdep + #endif for (int index = 0; index < nFluid; ++index) { #define I(index, dir) P_INDEX_3((nCells), (index), (dir)) @@ -171,7 +173,7 @@ void FNAME(D3Q19ListKernel)(LatticeDesc * ld, KernelData * kernelData, CaseData pdf_T + pdf_TN + pdf_TE + pdf_TS + pdf_TW + pdf_B + pdf_BN + pdf_BE + pdf_BS + pdf_BW; - dir_indep_trm = dens - (ux * ux + uy * uy + uz * uz)*3.0/2.0; + dir_indep_trm = dens - (ux * ux + uy * uy + uz * uz) * F(3.0) / F(2.0); #ifdef PROP_MODEL_PUSH @@ -184,20 +186,20 @@ void FNAME(D3Q19ListKernel)(LatticeDesc * ld, KernelData * kernelData, CaseData w_1_indep = w_1*dir_indep_trm; ui = uy; - evenPart = omegaEven*( 0.5*(pdf_N + pdf_S) - ui*ui*w_1_nine_half - w_1_indep ); - oddPart = omegaOdd*(0.5*(pdf_N - pdf_S) - ui*w_1_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_N + pdf_S) - ui*ui*w_1_nine_half - w_1_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_N - pdf_S) - ui*w_1_x3 ); dst[adjList[adjListIndex + D3Q19_N]] = pdf_N - evenPart - oddPart; dst[adjList[adjListIndex + D3Q19_S]] = pdf_S - evenPart + oddPart; ui = ux; - evenPart = omegaEven*( 0.5*(pdf_E + pdf_W) - ui*ui*w_1_nine_half - w_1_indep ); - oddPart = omegaOdd*(0.5*(pdf_E - pdf_W) - ui*w_1_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_E + pdf_W) - ui*ui*w_1_nine_half - w_1_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_E - pdf_W) - ui*w_1_x3 ); dst[adjList[adjListIndex + D3Q19_E]] = pdf_E - evenPart - oddPart; dst[adjList[adjListIndex + D3Q19_W]] = pdf_W - evenPart + oddPart; ui = uz; - evenPart = omegaEven*( 0.5*(pdf_T + pdf_B) - ui*ui*w_1_nine_half - w_1_indep ); - oddPart = omegaOdd*(0.5*(pdf_T - pdf_B) - ui*w_1_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_T + pdf_B) - ui*ui*w_1_nine_half - w_1_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_T - pdf_B) - ui*w_1_x3 ); dst[adjList[adjListIndex + D3Q19_T]] = pdf_T - evenPart - oddPart; dst[adjList[adjListIndex + D3Q19_B]] = pdf_B - evenPart + oddPart; @@ -205,38 +207,38 @@ void FNAME(D3Q19ListKernel)(LatticeDesc * ld, KernelData * kernelData, CaseData w_2_indep = w_2*dir_indep_trm; ui = -ux + uy; - evenPart = omegaEven*( 0.5*(pdf_NW + pdf_SE) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_NW - pdf_SE) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_NW + pdf_SE) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_NW - pdf_SE) - ui*w_2_x3 ); dst[adjList[adjListIndex + D3Q19_NW]] = pdf_NW - evenPart - oddPart; dst[adjList[adjListIndex + D3Q19_SE]] = pdf_SE - evenPart + oddPart; ui = ux + uy; - evenPart = omegaEven*( 0.5*(pdf_NE + pdf_SW) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_NE - pdf_SW) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_NE + pdf_SW) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_NE - pdf_SW) - ui*w_2_x3 ); dst[adjList[adjListIndex + D3Q19_NE]] = pdf_NE - evenPart - oddPart; dst[adjList[adjListIndex + D3Q19_SW]] = pdf_SW - evenPart + oddPart; ui = -ux + uz; - evenPart = omegaEven*( 0.5*(pdf_TW + pdf_BE) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_TW - pdf_BE) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_TW + pdf_BE) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_TW - pdf_BE) - ui*w_2_x3 ); dst[adjList[adjListIndex + D3Q19_TW]] = pdf_TW - evenPart - oddPart; dst[adjList[adjListIndex + D3Q19_BE]] = pdf_BE - evenPart + oddPart; ui = ux + uz; - evenPart = omegaEven*( 0.5*(pdf_TE + pdf_BW) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_TE - pdf_BW) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_TE + pdf_BW) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_TE - pdf_BW) - ui*w_2_x3 ); dst[adjList[adjListIndex + D3Q19_TE]] = pdf_TE - evenPart - oddPart; dst[adjList[adjListIndex + D3Q19_BW]] = pdf_BW - evenPart + oddPart; ui = -uy + uz; - evenPart = omegaEven*( 0.5*(pdf_TS + pdf_BN) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_TS - pdf_BN) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_TS + pdf_BN) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_TS - pdf_BN) - ui*w_2_x3 ); dst[adjList[adjListIndex + D3Q19_TS]] = pdf_TS - evenPart - oddPart; dst[adjList[adjListIndex + D3Q19_BN]] = pdf_BN - evenPart + oddPart; ui = uy + uz; - evenPart = omegaEven*( 0.5*(pdf_TN + pdf_BS) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_TN - pdf_BS) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_TN + pdf_BS) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_TN - pdf_BS) - ui*w_2_x3 ); dst[adjList[adjListIndex + D3Q19_TN]] = pdf_TN - evenPart - oddPart; dst[adjList[adjListIndex + D3Q19_BS]] = pdf_BS - evenPart + oddPart; @@ -249,20 +251,20 @@ void FNAME(D3Q19ListKernel)(LatticeDesc * ld, KernelData * kernelData, CaseData w_1_indep = w_1*dir_indep_trm; ui = uy; - evenPart = omegaEven*( 0.5*(pdf_N + pdf_S) - ui*ui*w_1_nine_half - w_1_indep ); - oddPart = omegaOdd*(0.5*(pdf_N - pdf_S) - ui*w_1_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_N + pdf_S) - ui*ui*w_1_nine_half - w_1_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_N - pdf_S) - ui*w_1_x3 ); dst[I(index, D3Q19_N )] = pdf_N - evenPart - oddPart; dst[I(index, D3Q19_S )] = pdf_S - evenPart + oddPart; ui = ux; - evenPart = omegaEven*( 0.5*(pdf_E + pdf_W) - ui*ui*w_1_nine_half - w_1_indep ); - oddPart = omegaOdd*(0.5*(pdf_E - pdf_W) - ui*w_1_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_E + pdf_W) - ui*ui*w_1_nine_half - w_1_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_E - pdf_W) - ui*w_1_x3 ); dst[I(index, D3Q19_E )] = pdf_E - evenPart - oddPart; dst[I(index, D3Q19_W )] = pdf_W - evenPart + oddPart; ui = uz; - evenPart = omegaEven*( 0.5*(pdf_T + pdf_B) - ui*ui*w_1_nine_half - w_1_indep ); - oddPart = omegaOdd*(0.5*(pdf_T - pdf_B) - ui*w_1_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_T + pdf_B) - ui*ui*w_1_nine_half - w_1_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_T - pdf_B) - ui*w_1_x3 ); dst[I(index, D3Q19_T )] = pdf_T - evenPart - oddPart; dst[I(index, D3Q19_B )] = pdf_B - evenPart + oddPart; @@ -270,38 +272,38 @@ void FNAME(D3Q19ListKernel)(LatticeDesc * ld, KernelData * kernelData, CaseData w_2_indep = w_2*dir_indep_trm; ui = -ux + uy; - evenPart = omegaEven*( 0.5*(pdf_NW + pdf_SE) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_NW - pdf_SE) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_NW + pdf_SE) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_NW - pdf_SE) - ui*w_2_x3 ); dst[I(index, D3Q19_NW)] = pdf_NW - evenPart - oddPart; dst[I(index, D3Q19_SE)] = pdf_SE - evenPart + oddPart; ui = ux + uy; - evenPart = omegaEven*( 0.5*(pdf_NE + pdf_SW) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_NE - pdf_SW) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_NE + pdf_SW) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_NE - pdf_SW) - ui*w_2_x3 ); dst[I(index, D3Q19_NE)] = pdf_NE - evenPart - oddPart; dst[I(index, D3Q19_SW)] = pdf_SW - evenPart + oddPart; ui = -ux + uz; - evenPart = omegaEven*( 0.5*(pdf_TW + pdf_BE) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_TW - pdf_BE) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_TW + pdf_BE) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_TW - pdf_BE) - ui*w_2_x3 ); dst[I(index, D3Q19_TW)] = pdf_TW - evenPart - oddPart; dst[I(index, D3Q19_BE)] = pdf_BE - evenPart + oddPart; ui = ux + uz; - evenPart = omegaEven*( 0.5*(pdf_TE + pdf_BW) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_TE - pdf_BW) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_TE + pdf_BW) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_TE - pdf_BW) - ui*w_2_x3 ); dst[I(index, D3Q19_TE)] = pdf_TE - evenPart - oddPart; dst[I(index, D3Q19_BW)] = pdf_BW - evenPart + oddPart; ui = -uy + uz; - evenPart = omegaEven*( 0.5*(pdf_TS + pdf_BN) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_TS - pdf_BN) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_TS + pdf_BN) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_TS - pdf_BN) - ui*w_2_x3 ); dst[I(index, D3Q19_TS)] = pdf_TS - evenPart - oddPart; dst[I(index, D3Q19_BN)] = pdf_BN - evenPart + oddPart; ui = uy + uz; - evenPart = omegaEven*( 0.5*(pdf_TN + pdf_BS) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_TN - pdf_BS) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_TN + pdf_BS) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_TN - pdf_BS) - ui*w_2_x3 ); dst[I(index, D3Q19_TN)] = pdf_TN - evenPart - oddPart; dst[I(index, D3Q19_BS)] = pdf_BS - evenPart + oddPart; diff --git a/src/BenchKernelD3Q19ListAa.c b/src/BenchKernelD3Q19ListAa.c index 2c3572c..045a396 100644 --- a/src/BenchKernelD3Q19ListAa.c +++ b/src/BenchKernelD3Q19ListAa.c @@ -39,8 +39,8 @@ void FNAME(D3Q19ListAaKernel)(LatticeDesc * ld, KernelData * kernelData, CaseDat Assert(kernelData != NULL); Assert(cd != NULL); - Assert(cd->Omega > 0.0); - Assert(cd->Omega < 2.0); + Assert(cd->Omega > F(0.0)); + Assert(cd->Omega < F(2.0)); KernelData * kd = (KernelData *)kernelData; KernelDataList * kdl = (KernelDataList *)kernelData; @@ -51,19 +51,19 @@ void FNAME(D3Q19ListAaKernel)(LatticeDesc * ld, KernelData * kernelData, CaseDat // 1/12: removes third-order advection error (best advection); // 1/6: removes fourth-order diffusion error (best diffusion); // 3/16: exact location of bounce back for poiseuille flow - PdfT magicParam = 1.0/12.0; - PdfT omegaOdd = 1.0/( 0.5 + magicParam/(1.0/omega - 0.5) ); + PdfT magicParam = F(1.0)/F(12.0); + PdfT omegaOdd = F(1.0)/( F(0.5) + magicParam/(F(1.0)/omega - F(0.5)) ); - PdfT evenPart = 0.0; - PdfT oddPart = 0.0; - PdfT dir_indep_trm = 0.0; + PdfT evenPart = F(0.0); + PdfT oddPart = F(0.0); + PdfT dir_indep_trm = F(0.0); - const PdfT w_0 = 1.0 / 3.0; - const PdfT w_1 = 1.0 / 18.0; - const PdfT w_2 = 1.0 / 36.0; + const PdfT w_0 = F(1.0) / F(3.0); + const PdfT w_1 = F(1.0) / F(18.0); + const PdfT w_2 = F(1.0) / F(36.0); - const PdfT w_1_x3 = w_1 * 3.0; const PdfT w_1_nine_half = w_1 * 9.0/2.0; PdfT w_1_indep = 0.0; - const PdfT w_2_x3 = w_2 * 3.0; const PdfT w_2_nine_half = w_2 * 9.0/2.0; PdfT w_2_indep = 0.0; + const PdfT w_1_x3 = w_1 * F(3.0); const PdfT w_1_nine_half = w_1 * F(9.0)/F(2.0); PdfT w_1_indep = F(0.0); + const PdfT w_2_x3 = w_2 * F(3.0); const PdfT w_2_nine_half = w_2 * F(9.0)/F(2.0); PdfT w_2_indep = F(0.0); PdfT ui; @@ -107,7 +107,7 @@ void FNAME(D3Q19ListAaKernel)(LatticeDesc * ld, KernelData * kernelData, CaseDat X_LIKWID_START("list-aa-even"); - #ifdef _OPENMP +#ifdef _OPENMP #pragma omp parallel for default(none) \ shared(nFluid, nCells, kd, kdl, adjList, omegaOdd, omegaEven, src) \ private(ux, uy, uz, dens, adjListIndex, evenPart, oddPart, dir_indep_trm, w_1_indep, w_2_indep, ui,\ @@ -116,7 +116,12 @@ void FNAME(D3Q19ListAaKernel)(LatticeDesc * ld, KernelData * kernelData, CaseDat pdf_NE, pdf_SE, pdf_SW, pdf_NW, \ pdf_T, pdf_TN, pdf_TE, pdf_TS, pdf_TW, \ pdf_B, pdf_BN, pdf_BE, pdf_BS, pdf_BW) - #endif +#endif +#ifdef INTEL_OPT_DIRECTIVES + #pragma ivdep + #pragma vector always + #pragma simd +#endif for (int index = 0; index < nFluid; ++index) { @@ -160,7 +165,7 @@ void FNAME(D3Q19ListAaKernel)(LatticeDesc * ld, KernelData * kernelData, CaseDat pdf_T + pdf_TN + pdf_TE + pdf_TS + pdf_TW + pdf_B + pdf_BN + pdf_BE + pdf_BS + pdf_BW; - dir_indep_trm = dens - (ux * ux + uy * uy + uz * uz)*3.0/2.0; + dir_indep_trm = dens - (ux * ux + uy * uy + uz * uz)*F(3.0)/F(2.0); // direction: w_0 src[I(index, D3Q19_C) ] = pdf_C - omegaEven*(pdf_C - w_0*dir_indep_trm); @@ -169,20 +174,20 @@ void FNAME(D3Q19ListAaKernel)(LatticeDesc * ld, KernelData * kernelData, CaseDat w_1_indep = w_1*dir_indep_trm; ui = uy; - evenPart = omegaEven*( 0.5*(pdf_N + pdf_S) - ui*ui*w_1_nine_half - w_1_indep ); - oddPart = omegaOdd*(0.5*(pdf_N - pdf_S) - ui*w_1_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_N + pdf_S) - ui*ui*w_1_nine_half - w_1_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_N - pdf_S) - ui*w_1_x3 ); src[I(index, D3Q19_S)] = pdf_N - evenPart - oddPart; src[I(index, D3Q19_N)] = pdf_S - evenPart + oddPart; ui = ux; - evenPart = omegaEven*( 0.5*(pdf_E + pdf_W) - ui*ui*w_1_nine_half - w_1_indep ); - oddPart = omegaOdd*(0.5*(pdf_E - pdf_W) - ui*w_1_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_E + pdf_W) - ui*ui*w_1_nine_half - w_1_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_E - pdf_W) - ui*w_1_x3 ); src[I(index, D3Q19_W)] = pdf_E - evenPart - oddPart; src[I(index, D3Q19_E)] = pdf_W - evenPart + oddPart; ui = uz; - evenPart = omegaEven*( 0.5*(pdf_T + pdf_B) - ui*ui*w_1_nine_half - w_1_indep ); - oddPart = omegaOdd*(0.5*(pdf_T - pdf_B) - ui*w_1_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_T + pdf_B) - ui*ui*w_1_nine_half - w_1_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_T - pdf_B) - ui*w_1_x3 ); src[I(index, D3Q19_B)] = pdf_T - evenPart - oddPart; src[I(index, D3Q19_T)] = pdf_B - evenPart + oddPart; @@ -190,38 +195,38 @@ void FNAME(D3Q19ListAaKernel)(LatticeDesc * ld, KernelData * kernelData, CaseDat w_2_indep = w_2*dir_indep_trm; ui = -ux + uy; - evenPart = omegaEven*( 0.5*(pdf_NW + pdf_SE) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_NW - pdf_SE) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_NW + pdf_SE) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_NW - pdf_SE) - ui*w_2_x3 ); src[I(index, D3Q19_SE)] = pdf_NW - evenPart - oddPart; src[I(index, D3Q19_NW)] = pdf_SE - evenPart + oddPart; ui = ux + uy; - evenPart = omegaEven*( 0.5*(pdf_NE + pdf_SW) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_NE - pdf_SW) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_NE + pdf_SW) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_NE - pdf_SW) - ui*w_2_x3 ); src[I(index, D3Q19_SW)] = pdf_NE - evenPart - oddPart; src[I(index, D3Q19_NE)] = pdf_SW - evenPart + oddPart; ui = -ux + uz; - evenPart = omegaEven*( 0.5*(pdf_TW + pdf_BE) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_TW - pdf_BE) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_TW + pdf_BE) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_TW - pdf_BE) - ui*w_2_x3 ); src[I(index, D3Q19_BE)] = pdf_TW - evenPart - oddPart; src[I(index, D3Q19_TW)] = pdf_BE - evenPart + oddPart; ui = ux + uz; - evenPart = omegaEven*( 0.5*(pdf_TE + pdf_BW) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_TE - pdf_BW) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_TE + pdf_BW) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_TE - pdf_BW) - ui*w_2_x3 ); src[I(index, D3Q19_BW)] = pdf_TE - evenPart - oddPart; src[I(index, D3Q19_TE)] = pdf_BW - evenPart + oddPart; ui = -uy + uz; - evenPart = omegaEven*( 0.5*(pdf_TS + pdf_BN) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_TS - pdf_BN) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_TS + pdf_BN) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_TS - pdf_BN) - ui*w_2_x3 ); src[I(index, D3Q19_BN)] = pdf_TS - evenPart - oddPart; src[I(index, D3Q19_TS)] = pdf_BN - evenPart + oddPart; ui = uy + uz; - evenPart = omegaEven*( 0.5*(pdf_TN + pdf_BS) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_TN - pdf_BS) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_TN + pdf_BS) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_TN - pdf_BS) - ui*w_2_x3 ); src[I(index, D3Q19_BS)] = pdf_TN - evenPart - oddPart; src[I(index, D3Q19_TN)] = pdf_BS - evenPart + oddPart; @@ -250,6 +255,9 @@ void FNAME(D3Q19ListAaKernel)(LatticeDesc * ld, KernelData * kernelData, CaseDat pdf_NE, pdf_SE, pdf_SW, pdf_NW, \ pdf_T, pdf_TN, pdf_TE, pdf_TS, pdf_TW, \ pdf_B, pdf_BN, pdf_BE, pdf_BS, pdf_BW) +#endif +#ifdef INTEL_OPT_DIRECTIVES + #pragma ivdep #endif for (int index = 0; index < nFluid; ++index) { @@ -273,9 +281,9 @@ void FNAME(D3Q19ListAaKernel)(LatticeDesc * ld, KernelData * kernelData, CaseDat int z = kdl->Coords[C_INDEX_Z(index)]; if (z == nZ - 4 && x > 3 && x < (nX - 4) && y > 3 && y < (nY - 4)) { - ux = 0.1 * 0.577; - uy = 0.0; - uz = 0.0; + ux = F(0.1) * F(0.5)77; + uy = F(0.0); + uz = F(0.0); } else { #endif ux = pdf_E + pdf_NE + pdf_SE + pdf_TE + pdf_BE - @@ -294,7 +302,7 @@ void FNAME(D3Q19ListAaKernel)(LatticeDesc * ld, KernelData * kernelData, CaseDat pdf_T + pdf_TN + pdf_TE + pdf_TS + pdf_TW + pdf_B + pdf_BN + pdf_BE + pdf_BS + pdf_BW; - dir_indep_trm = dens - (ux * ux + uy * uy + uz * uz)*3.0/2.0; + dir_indep_trm = dens - (ux * ux + uy * uy + uz * uz)*F(3.0)/F(2.0); adjListIndex = index * N_D3Q19_IDX; @@ -305,20 +313,20 @@ void FNAME(D3Q19ListAaKernel)(LatticeDesc * ld, KernelData * kernelData, CaseDat w_1_indep = w_1*dir_indep_trm; ui = uy; - evenPart = omegaEven*( 0.5*(pdf_N + pdf_S) - ui*ui*w_1_nine_half - w_1_indep ); - oddPart = omegaOdd*(0.5*(pdf_N - pdf_S) - ui*w_1_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_N + pdf_S) - ui*ui*w_1_nine_half - w_1_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_N - pdf_S) - ui*w_1_x3 ); src[adjList[adjListIndex + D3Q19_N]] = pdf_N - evenPart - oddPart; src[adjList[adjListIndex + D3Q19_S]] = pdf_S - evenPart + oddPart; ui = ux; - evenPart = omegaEven*( 0.5*(pdf_E + pdf_W) - ui*ui*w_1_nine_half - w_1_indep ); - oddPart = omegaOdd*(0.5*(pdf_E - pdf_W) - ui*w_1_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_E + pdf_W) - ui*ui*w_1_nine_half - w_1_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_E - pdf_W) - ui*w_1_x3 ); src[adjList[adjListIndex + D3Q19_E]] = pdf_E - evenPart - oddPart; src[adjList[adjListIndex + D3Q19_W]] = pdf_W - evenPart + oddPart; ui = uz; - evenPart = omegaEven*( 0.5*(pdf_T + pdf_B) - ui*ui*w_1_nine_half - w_1_indep ); - oddPart = omegaOdd*(0.5*(pdf_T - pdf_B) - ui*w_1_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_T + pdf_B) - ui*ui*w_1_nine_half - w_1_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_T - pdf_B) - ui*w_1_x3 ); src[adjList[adjListIndex + D3Q19_T]] = pdf_T - evenPart - oddPart; src[adjList[adjListIndex + D3Q19_B]] = pdf_B - evenPart + oddPart; @@ -326,38 +334,38 @@ void FNAME(D3Q19ListAaKernel)(LatticeDesc * ld, KernelData * kernelData, CaseDat w_2_indep = w_2*dir_indep_trm; ui = -ux + uy; - evenPart = omegaEven*( 0.5*(pdf_NW + pdf_SE) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_NW - pdf_SE) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_NW + pdf_SE) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_NW - pdf_SE) - ui*w_2_x3 ); src[adjList[adjListIndex + D3Q19_NW]] = pdf_NW - evenPart - oddPart; src[adjList[adjListIndex + D3Q19_SE]] = pdf_SE - evenPart + oddPart; ui = ux + uy; - evenPart = omegaEven*( 0.5*(pdf_NE + pdf_SW) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_NE - pdf_SW) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_NE + pdf_SW) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_NE - pdf_SW) - ui*w_2_x3 ); src[adjList[adjListIndex + D3Q19_NE]] = pdf_NE - evenPart - oddPart; src[adjList[adjListIndex + D3Q19_SW]] = pdf_SW - evenPart + oddPart; ui = -ux + uz; - evenPart = omegaEven*( 0.5*(pdf_TW + pdf_BE) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_TW - pdf_BE) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_TW + pdf_BE) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_TW - pdf_BE) - ui*w_2_x3 ); src[adjList[adjListIndex + D3Q19_TW]] = pdf_TW - evenPart - oddPart; src[adjList[adjListIndex + D3Q19_BE]] = pdf_BE - evenPart + oddPart; ui = ux + uz; - evenPart = omegaEven*( 0.5*(pdf_TE + pdf_BW) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_TE - pdf_BW) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_TE + pdf_BW) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_TE - pdf_BW) - ui*w_2_x3 ); src[adjList[adjListIndex + D3Q19_TE]] = pdf_TE - evenPart - oddPart; src[adjList[adjListIndex + D3Q19_BW]] = pdf_BW - evenPart + oddPart; ui = -uy + uz; - evenPart = omegaEven*( 0.5*(pdf_TS + pdf_BN) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_TS - pdf_BN) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_TS + pdf_BN) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_TS - pdf_BN) - ui*w_2_x3 ); src[adjList[adjListIndex + D3Q19_TS]] = pdf_TS - evenPart - oddPart; src[adjList[adjListIndex + D3Q19_BN]] = pdf_BN - evenPart + oddPart; ui = uy + uz; - evenPart = omegaEven*( 0.5*(pdf_TN + pdf_BS) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_TN - pdf_BS) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_TN + pdf_BS) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_TN - pdf_BS) - ui*w_2_x3 ); src[adjList[adjListIndex + D3Q19_TN]] = pdf_TN - evenPart - oddPart; src[adjList[adjListIndex + D3Q19_BS]] = pdf_BS - evenPart + oddPart; diff --git a/src/BenchKernelD3Q19ListAaPv.c b/src/BenchKernelD3Q19ListAaPv.c index b1dee16..8ae0c2c 100644 --- a/src/BenchKernelD3Q19ListAaPv.c +++ b/src/BenchKernelD3Q19ListAaPv.c @@ -49,8 +49,8 @@ void FNAME(D3Q19ListAaPvKernel)(LatticeDesc * ld, KernelData * kernelData, CaseD Assert(kernelData != NULL); Assert(cd != NULL); - Assert(cd->Omega > 0.0); - Assert(cd->Omega < 2.0); + Assert(cd->Omega > F(0.0)); + Assert(cd->Omega < F(2.0));; #if defined(VTK_OUTPUT) || defined(STATISTICS) || defined(VERIFICATION) KernelData * kd = (KernelData *)kernelData; @@ -160,8 +160,8 @@ static void KernelEven(LatticeDesc * ld, KernelData * kernelData, CaseData * cd) Assert(kernelData != NULL); Assert(cd != NULL); - Assert(cd->Omega > 0.0); - Assert(cd->Omega < 2.0); + Assert(cd->Omega > F(0.0)); + Assert(cd->Omega < F(2.0)); KernelData * kd = (KernelData *)kernelData; KernelDataList * kdl = KDL(kernelData); @@ -170,19 +170,19 @@ static void KernelEven(LatticeDesc * ld, KernelData * kernelData, CaseData * cd) PdfT omega = cd->Omega; PdfT omegaEven = omega; - PdfT magicParam = 1.0 / 12.0; - PdfT omegaOdd = 1.0 / (0.5 + magicParam / (1.0 / omega - 0.5)); + PdfT magicParam = F(1.0) / F(12.0); + PdfT omegaOdd = F(1.0) / (F(0.5) + magicParam / (F(1.0) / omega - F(0.5))); - PdfT evenPart = 0.0; - PdfT oddPart = 0.0; - PdfT dir_indep_trm = 0.0; + PdfT evenPart = F(0.0); + PdfT oddPart = F(0.0); + PdfT dir_indep_trm = F(0.0); - const PdfT w_0 = 1.0 / 3.0; - const PdfT w_1 = 1.0 / 18.0; - const PdfT w_2 = 1.0 / 36.0; + const PdfT w_0 = F(1.0) / F( 3.0); + const PdfT w_1 = F(1.0) / F(18.0); + const PdfT w_2 = F(1.0) / F(36.0); - const PdfT w_1_x3 = w_1 * 3.0; const PdfT w_1_nine_half = w_1 * 9.0 / 2.0; PdfT w_1_indep = 0.0; - const PdfT w_2_x3 = w_2 * 3.0; const PdfT w_2_nine_half = w_2 * 9.0 / 2.0; PdfT w_2_indep = 0.0; + const PdfT w_1_x3 = w_1 * F(3.0); const PdfT w_1_nine_half = w_1 * F(9.0) / F(2.0); PdfT w_1_indep = F(0.0); + const PdfT w_2_x3 = w_2 * F(3.0); const PdfT w_2_nine_half = w_2 * F(9.0) / F(2.0); PdfT w_2_indep = F(0.0); PdfT ui; @@ -190,8 +190,8 @@ static void KernelEven(LatticeDesc * ld, KernelData * kernelData, CaseData * cd) PdfT dens; - VPDFT VONE_HALF = VSET(0.5); - VPDFT VTHREE_HALF = VSET(3.0 / 2.0); + VPDFT VONE_HALF = VSET(F(0.5)); + VPDFT VTHREE_HALF = VSET(F(3.0) / F(2.0)); VPDFT vw_1_indep, vw_2_indep; VPDFT vw_0 = VSET(w_0); @@ -393,7 +393,7 @@ static void KernelEven(LatticeDesc * ld, KernelData * kernelData, CaseData * cd) pdf_T + pdf_TN + pdf_TE + pdf_TS + pdf_TW + pdf_B + pdf_BN + pdf_BE + pdf_BS + pdf_BW; - dir_indep_trm = dens - (ux * ux + uy * uy + uz * uz)*3.0/2.0; + dir_indep_trm = dens - (ux * ux + uy * uy + uz * uz) * F(3.0) / F(2.0); // direction: w_0 src[I(index, D3Q19_C) ] = pdf_C - omegaEven*(pdf_C - w_0*dir_indep_trm); @@ -402,20 +402,20 @@ static void KernelEven(LatticeDesc * ld, KernelData * kernelData, CaseData * cd) w_1_indep = w_1*dir_indep_trm; ui = uy; - evenPart = omegaEven*( 0.5*(pdf_N + pdf_S) - ui*ui*w_1_nine_half - w_1_indep ); - oddPart = omegaOdd*(0.5*(pdf_N - pdf_S) - ui*w_1_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_N + pdf_S) - ui*ui*w_1_nine_half - w_1_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_N - pdf_S) - ui*w_1_x3 ); src[I(index, D3Q19_S)] = pdf_N - evenPart - oddPart; src[I(index, D3Q19_N)] = pdf_S - evenPart + oddPart; ui = ux; - evenPart = omegaEven*( 0.5*(pdf_E + pdf_W) - ui*ui*w_1_nine_half - w_1_indep ); - oddPart = omegaOdd*(0.5*(pdf_E - pdf_W) - ui*w_1_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_E + pdf_W) - ui*ui*w_1_nine_half - w_1_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_E - pdf_W) - ui*w_1_x3 ); src[I(index, D3Q19_W)] = pdf_E - evenPart - oddPart; src[I(index, D3Q19_E)] = pdf_W - evenPart + oddPart; ui = uz; - evenPart = omegaEven*( 0.5*(pdf_T + pdf_B) - ui*ui*w_1_nine_half - w_1_indep ); - oddPart = omegaOdd*(0.5*(pdf_T - pdf_B) - ui*w_1_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_T + pdf_B) - ui*ui*w_1_nine_half - w_1_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_T - pdf_B) - ui*w_1_x3 ); src[I(index, D3Q19_B)] = pdf_T - evenPart - oddPart; src[I(index, D3Q19_T)] = pdf_B - evenPart + oddPart; @@ -423,38 +423,38 @@ static void KernelEven(LatticeDesc * ld, KernelData * kernelData, CaseData * cd) w_2_indep = w_2*dir_indep_trm; ui = -ux + uy; - evenPart = omegaEven*( 0.5*(pdf_NW + pdf_SE) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_NW - pdf_SE) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_NW + pdf_SE) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_NW - pdf_SE) - ui*w_2_x3 ); src[I(index, D3Q19_SE)] = pdf_NW - evenPart - oddPart; src[I(index, D3Q19_NW)] = pdf_SE - evenPart + oddPart; ui = ux + uy; - evenPart = omegaEven*( 0.5*(pdf_NE + pdf_SW) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_NE - pdf_SW) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_NE + pdf_SW) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_NE - pdf_SW) - ui*w_2_x3 ); src[I(index, D3Q19_SW)] = pdf_NE - evenPart - oddPart; src[I(index, D3Q19_NE)] = pdf_SW - evenPart + oddPart; ui = -ux + uz; - evenPart = omegaEven*( 0.5*(pdf_TW + pdf_BE) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_TW - pdf_BE) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_TW + pdf_BE) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_TW - pdf_BE) - ui*w_2_x3 ); src[I(index, D3Q19_BE)] = pdf_TW - evenPart - oddPart; src[I(index, D3Q19_TW)] = pdf_BE - evenPart + oddPart; ui = ux + uz; - evenPart = omegaEven*( 0.5*(pdf_TE + pdf_BW) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_TE - pdf_BW) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_TE + pdf_BW) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_TE - pdf_BW) - ui*w_2_x3 ); src[I(index, D3Q19_BW)] = pdf_TE - evenPart - oddPart; src[I(index, D3Q19_TE)] = pdf_BW - evenPart + oddPart; ui = -uy + uz; - evenPart = omegaEven*( 0.5*(pdf_TS + pdf_BN) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_TS - pdf_BN) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_TS + pdf_BN) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_TS - pdf_BN) - ui*w_2_x3 ); src[I(index, D3Q19_BN)] = pdf_TS - evenPart - oddPart; src[I(index, D3Q19_TS)] = pdf_BN - evenPart + oddPart; ui = uy + uz; - evenPart = omegaEven*( 0.5*(pdf_TN + pdf_BS) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_TN - pdf_BS) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_TN + pdf_BS) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_TN - pdf_BS) - ui*w_2_x3 ); src[I(index, D3Q19_BS)] = pdf_TN - evenPart - oddPart; src[I(index, D3Q19_TN)] = pdf_BS - evenPart + oddPart; @@ -472,8 +472,8 @@ static void KernelOdd(LatticeDesc * ld, KernelData * kernelData, CaseData * cd) Assert(kernelData != NULL); Assert(cd != NULL); - Assert(cd->Omega > 0.0); - Assert(cd->Omega < 2.0); + Assert(cd->Omega > F(0.0)); + Assert(cd->Omega < F(2.0)); KernelData * kd = (KernelData *)kernelData; KernelDataList * kdl = KDL(kernelData); @@ -482,19 +482,19 @@ static void KernelOdd(LatticeDesc * ld, KernelData * kernelData, CaseData * cd) PdfT omega = cd->Omega; PdfT omegaEven = omega; - PdfT magicParam = 1.0 / 12.0; - PdfT omegaOdd = 1.0 / (0.5 + magicParam / (1.0 / omega - 0.5)); + PdfT magicParam = F(1.0) / F(12.0); + PdfT omegaOdd = F(1.0) / (F(0.5) + magicParam / (F(1.0) / omega - F(0.5))); - PdfT evenPart = 0.0; - PdfT oddPart = 0.0; - PdfT dir_indep_trm = 0.0; + PdfT evenPart = F(0.0); + PdfT oddPart = F(0.0); + PdfT dir_indep_trm = F(0.0); - const PdfT w_0 = 1.0 / 3.0; - const PdfT w_1 = 1.0 / 18.0; - const PdfT w_2 = 1.0 / 36.0; + const PdfT w_0 = F(1.0) / F( 3.0); + const PdfT w_1 = F(1.0) / F(18.0); + const PdfT w_2 = F(1.0) / F(36.0); - const PdfT w_1_x3 = w_1 * 3.0; const PdfT w_1_nine_half = w_1 * 9.0 / 2.0; PdfT w_1_indep = 0.0; - const PdfT w_2_x3 = w_2 * 3.0; const PdfT w_2_nine_half = w_2 * 9.0 / 2.0; PdfT w_2_indep = 0.0; + const PdfT w_1_x3 = w_1 * F(3.0); const PdfT w_1_nine_half = w_1 * F(9.0) / F(2.0); PdfT w_1_indep = F(0.0); + const PdfT w_2_x3 = w_2 * F(3.0); const PdfT w_2_nine_half = w_2 * F(9.0) / F(2.0); PdfT w_2_indep = F(0.0); PdfT ui; @@ -502,8 +502,8 @@ static void KernelOdd(LatticeDesc * ld, KernelData * kernelData, CaseData * cd) PdfT dens; - VPDFT VONE_HALF = VSET(0.5); - VPDFT VTHREE_HALF = VSET(3.0 / 2.0); + VPDFT VONE_HALF = VSET(F(0.5)); + VPDFT VTHREE_HALF = VSET(F(3.0) / F(2.0)); VPDFT vw_1_indep, vw_2_indep; VPDFT vw_0 = VSET(w_0); @@ -770,7 +770,7 @@ static void KernelOdd(LatticeDesc * ld, KernelData * kernelData, CaseData * cd) pdf_T + pdf_TN + pdf_TE + pdf_TS + pdf_TW + pdf_B + pdf_BN + pdf_BE + pdf_BS + pdf_BW; - dir_indep_trm = dens - (ux * ux + uy * uy + uz * uz)*3.0/2.0; + dir_indep_trm = dens - (ux * ux + uy * uy + uz * uz) * F(3.0) / F(2.0); adjListIndex = index * N_D3Q19_IDX; @@ -781,20 +781,20 @@ static void KernelOdd(LatticeDesc * ld, KernelData * kernelData, CaseData * cd) w_1_indep = w_1 * dir_indep_trm; ui = uy; - evenPart = omegaEven * (0.5 * (pdf_N + pdf_S) - ui * ui * w_1_nine_half - w_1_indep); - oddPart = omegaOdd * (0.5 * (pdf_N - pdf_S) - ui * w_1_x3); + evenPart = omegaEven * (F(0.5) * (pdf_N + pdf_S) - ui * ui * w_1_nine_half - w_1_indep); + oddPart = omegaOdd * (F(0.5) * (pdf_N - pdf_S) - ui * w_1_x3); *ppdf_S = pdf_N - evenPart - oddPart; *ppdf_N = pdf_S - evenPart + oddPart; ui = ux; - evenPart = omegaEven * (0.5 * (pdf_E + pdf_W) - ui * ui * w_1_nine_half - w_1_indep); - oddPart = omegaOdd * (0.5 * (pdf_E - pdf_W) - ui * w_1_x3); + evenPart = omegaEven * (F(0.5) * (pdf_E + pdf_W) - ui * ui * w_1_nine_half - w_1_indep); + oddPart = omegaOdd * (F(0.5) * (pdf_E - pdf_W) - ui * w_1_x3); *ppdf_W = pdf_E - evenPart - oddPart; *ppdf_E = pdf_W - evenPart + oddPart; ui = uz; - evenPart = omegaEven * (0.5 * (pdf_T + pdf_B) - ui * ui * w_1_nine_half - w_1_indep); - oddPart = omegaOdd * (0.5 * (pdf_T - pdf_B) - ui * w_1_x3); + evenPart = omegaEven * (F(0.5) * (pdf_T + pdf_B) - ui * ui * w_1_nine_half - w_1_indep); + oddPart = omegaOdd * (F(0.5) * (pdf_T - pdf_B) - ui * w_1_x3); *ppdf_B = pdf_T - evenPart - oddPart; *ppdf_T = pdf_B - evenPart + oddPart; @@ -802,38 +802,38 @@ static void KernelOdd(LatticeDesc * ld, KernelData * kernelData, CaseData * cd) w_2_indep = w_2 * dir_indep_trm; ui = -ux + uy; - evenPart = omegaEven * (0.5 * (pdf_NW + pdf_SE) - ui * ui * w_2_nine_half - w_2_indep); - oddPart = omegaOdd * (0.5 * (pdf_NW - pdf_SE) - ui * w_2_x3); + evenPart = omegaEven * (F(0.5) * (pdf_NW + pdf_SE) - ui * ui * w_2_nine_half - w_2_indep); + oddPart = omegaOdd * (F(0.5) * (pdf_NW - pdf_SE) - ui * w_2_x3); *ppdf_SE = pdf_NW - evenPart - oddPart; *ppdf_NW = pdf_SE - evenPart + oddPart; ui = ux + uy; - evenPart = omegaEven * (0.5 * (pdf_NE + pdf_SW) - ui * ui * w_2_nine_half - w_2_indep); - oddPart = omegaOdd * (0.5 * (pdf_NE - pdf_SW) - ui * w_2_x3); + evenPart = omegaEven * (F(0.5) * (pdf_NE + pdf_SW) - ui * ui * w_2_nine_half - w_2_indep); + oddPart = omegaOdd * (F(0.5) * (pdf_NE - pdf_SW) - ui * w_2_x3); *ppdf_SW = pdf_NE - evenPart - oddPart; *ppdf_NE = pdf_SW - evenPart + oddPart; ui = -ux + uz; - evenPart = omegaEven * (0.5 * (pdf_TW + pdf_BE) - ui * ui * w_2_nine_half - w_2_indep); - oddPart = omegaOdd * (0.5 * (pdf_TW - pdf_BE) - ui * w_2_x3); + evenPart = omegaEven * (F(0.5) * (pdf_TW + pdf_BE) - ui * ui * w_2_nine_half - w_2_indep); + oddPart = omegaOdd * (F(0.5) * (pdf_TW - pdf_BE) - ui * w_2_x3); *ppdf_BE = pdf_TW - evenPart - oddPart; *ppdf_TW = pdf_BE - evenPart + oddPart; ui = ux + uz; - evenPart = omegaEven * (0.5 * (pdf_TE + pdf_BW) - ui * ui * w_2_nine_half - w_2_indep); - oddPart = omegaOdd * (0.5 * (pdf_TE - pdf_BW) - ui * w_2_x3); + evenPart = omegaEven * (F(0.5) * (pdf_TE + pdf_BW) - ui * ui * w_2_nine_half - w_2_indep); + oddPart = omegaOdd * (F(0.5) * (pdf_TE - pdf_BW) - ui * w_2_x3); *ppdf_BW = pdf_TE - evenPart - oddPart; *ppdf_TE = pdf_BW - evenPart + oddPart; ui = -uy + uz; - evenPart = omegaEven * (0.5 * (pdf_TS + pdf_BN) - ui * ui * w_2_nine_half - w_2_indep); - oddPart = omegaOdd * (0.5 * (pdf_TS - pdf_BN) - ui * w_2_x3); + evenPart = omegaEven * (F(0.5) * (pdf_TS + pdf_BN) - ui * ui * w_2_nine_half - w_2_indep); + oddPart = omegaOdd * (F(0.5) * (pdf_TS - pdf_BN) - ui * w_2_x3); *ppdf_BN = pdf_TS - evenPart - oddPart; *ppdf_TS = pdf_BN - evenPart + oddPart; ui = uy + uz; - evenPart = omegaEven * (0.5 * (pdf_TN + pdf_BS) - ui * ui * w_2_nine_half - w_2_indep); - oddPart = omegaOdd * (0.5 * (pdf_TN - pdf_BS) - ui * w_2_x3); + evenPart = omegaEven * (F(0.5) * (pdf_TN + pdf_BS) - ui * ui * w_2_nine_half - w_2_indep); + oddPart = omegaOdd * (F(0.5) * (pdf_TN - pdf_BS) - ui * w_2_x3); *ppdf_BS = pdf_TN - evenPart - oddPart; *ppdf_TN = pdf_BS - evenPart + oddPart; diff --git a/src/BenchKernelD3Q19ListAaRia.c b/src/BenchKernelD3Q19ListAaRia.c index 87addcc..245c2a5 100644 --- a/src/BenchKernelD3Q19ListAaRia.c +++ b/src/BenchKernelD3Q19ListAaRia.c @@ -55,19 +55,19 @@ void FNAME(D3Q19ListAaRiaKernel)(LatticeDesc * ld, KernelData * kernelData, Case // 1/12: removes third-order advection error (best advection); // 1/ 6: removes fourth-order diffusion error (best diffusion); // 3/16: exact location of bounce back for poiseuille flow - PdfT magicParam = 1.0 / 12.0; - PdfT omegaOdd = 1.0 / (0.5 + magicParam / (1.0 / omega - 0.5)); + PdfT magicParam = F(1.0) / F(12.0); + PdfT omegaOdd = F(1.0) / (F(0.5) + magicParam / (F(1.0) / omega - F(0.5))); - PdfT evenPart = 0.0; - PdfT oddPart = 0.0; - PdfT dir_indep_trm = 0.0; + PdfT evenPart = F(0.0); + PdfT oddPart = F(0.0); + PdfT dir_indep_trm = F(0.0); - const PdfT w_0 = 1.0 / 3.0; - const PdfT w_1 = 1.0 / 18.0; - const PdfT w_2 = 1.0 / 36.0; + const PdfT w_0 = F(1.0) / F(3.0); + const PdfT w_1 = F(1.0) / F(18.0); + const PdfT w_2 = F(1.0) / F(36.0); - const PdfT w_1_x3 = w_1 * 3.0; const PdfT w_1_nine_half = w_1 * 9.0 / 2.0; PdfT w_1_indep = 0.0; - const PdfT w_2_x3 = w_2 * 3.0; const PdfT w_2_nine_half = w_2 * 9.0 / 2.0; PdfT w_2_indep = 0.0; + const PdfT w_1_x3 = w_1 * F(3.0); const PdfT w_1_nine_half = w_1 * F(9.0) / F(2.0); PdfT w_1_indep = F(0.0); + const PdfT w_2_x3 = w_2 * F(3.0); const PdfT w_2_nine_half = w_2 * F(9.0) / F(2.0); PdfT w_2_indep = F(0.0); PdfT ui; @@ -134,6 +134,11 @@ void FNAME(D3Q19ListAaRiaKernel)(LatticeDesc * ld, KernelData * kernelData, Case pdf_T, pdf_TN, pdf_TE, pdf_TS, pdf_TW, \ pdf_B, pdf_BN, pdf_BE, pdf_BS, pdf_BW) #endif + #ifdef INTEL_OPT_DIRECTIVES + #pragma ivdep + #pragma vector always + #pragma simd + #endif for (int index = 0; index < nFluid; ++index) { #define I(index, dir) P_INDEX_3((nCells), (index), (dir)) @@ -154,9 +159,9 @@ void FNAME(D3Q19ListAaRiaKernel)(LatticeDesc * ld, KernelData * kernelData, Case int z = kdl->Coords[C_INDEX_Z(index)]; if (z == nZ - 4 && x > 3 && x < (nX - 4) && y > 3 && y < (nY - 4)) { - ux = 0.1 * 0.577; - uy = 0.0; - uz = 0.0; + ux = F(0.1) * F(0.5)77; + uy = F(0.0); + uz = F(0.0); } else { #endif ux = pdf_E + pdf_NE + pdf_SE + pdf_TE + pdf_BE - @@ -175,7 +180,7 @@ void FNAME(D3Q19ListAaRiaKernel)(LatticeDesc * ld, KernelData * kernelData, Case pdf_T + pdf_TN + pdf_TE + pdf_TS + pdf_TW + pdf_B + pdf_BN + pdf_BE + pdf_BS + pdf_BW; - dir_indep_trm = dens - (ux * ux + uy * uy + uz * uz)*3.0/2.0; + dir_indep_trm = dens - (ux * ux + uy * uy + uz * uz)*F(3.0)/F(2.0); // direction: w_0 src[I(index, D3Q19_C) ] = pdf_C - omegaEven*(pdf_C - w_0*dir_indep_trm); @@ -184,20 +189,20 @@ void FNAME(D3Q19ListAaRiaKernel)(LatticeDesc * ld, KernelData * kernelData, Case w_1_indep = w_1*dir_indep_trm; ui = uy; - evenPart = omegaEven*( 0.5*(pdf_N + pdf_S) - ui*ui*w_1_nine_half - w_1_indep ); - oddPart = omegaOdd*(0.5*(pdf_N - pdf_S) - ui*w_1_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_N + pdf_S) - ui*ui*w_1_nine_half - w_1_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_N - pdf_S) - ui*w_1_x3 ); src[I(index, D3Q19_S)] = pdf_N - evenPart - oddPart; src[I(index, D3Q19_N)] = pdf_S - evenPart + oddPart; ui = ux; - evenPart = omegaEven*( 0.5*(pdf_E + pdf_W) - ui*ui*w_1_nine_half - w_1_indep ); - oddPart = omegaOdd*(0.5*(pdf_E - pdf_W) - ui*w_1_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_E + pdf_W) - ui*ui*w_1_nine_half - w_1_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_E - pdf_W) - ui*w_1_x3 ); src[I(index, D3Q19_W)] = pdf_E - evenPart - oddPart; src[I(index, D3Q19_E)] = pdf_W - evenPart + oddPart; ui = uz; - evenPart = omegaEven*( 0.5*(pdf_T + pdf_B) - ui*ui*w_1_nine_half - w_1_indep ); - oddPart = omegaOdd*(0.5*(pdf_T - pdf_B) - ui*w_1_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_T + pdf_B) - ui*ui*w_1_nine_half - w_1_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_T - pdf_B) - ui*w_1_x3 ); src[I(index, D3Q19_B)] = pdf_T - evenPart - oddPart; src[I(index, D3Q19_T)] = pdf_B - evenPart + oddPart; @@ -205,38 +210,38 @@ void FNAME(D3Q19ListAaRiaKernel)(LatticeDesc * ld, KernelData * kernelData, Case w_2_indep = w_2*dir_indep_trm; ui = -ux + uy; - evenPart = omegaEven*( 0.5*(pdf_NW + pdf_SE) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_NW - pdf_SE) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_NW + pdf_SE) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_NW - pdf_SE) - ui*w_2_x3 ); src[I(index, D3Q19_SE)] = pdf_NW - evenPart - oddPart; src[I(index, D3Q19_NW)] = pdf_SE - evenPart + oddPart; ui = ux + uy; - evenPart = omegaEven*( 0.5*(pdf_NE + pdf_SW) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_NE - pdf_SW) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_NE + pdf_SW) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_NE - pdf_SW) - ui*w_2_x3 ); src[I(index, D3Q19_SW)] = pdf_NE - evenPart - oddPart; src[I(index, D3Q19_NE)] = pdf_SW - evenPart + oddPart; ui = -ux + uz; - evenPart = omegaEven*( 0.5*(pdf_TW + pdf_BE) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_TW - pdf_BE) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_TW + pdf_BE) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_TW - pdf_BE) - ui*w_2_x3 ); src[I(index, D3Q19_BE)] = pdf_TW - evenPart - oddPart; src[I(index, D3Q19_TW)] = pdf_BE - evenPart + oddPart; ui = ux + uz; - evenPart = omegaEven*( 0.5*(pdf_TE + pdf_BW) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_TE - pdf_BW) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_TE + pdf_BW) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_TE - pdf_BW) - ui*w_2_x3 ); src[I(index, D3Q19_BW)] = pdf_TE - evenPart - oddPart; src[I(index, D3Q19_TE)] = pdf_BW - evenPart + oddPart; ui = -uy + uz; - evenPart = omegaEven*( 0.5*(pdf_TS + pdf_BN) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_TS - pdf_BN) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_TS + pdf_BN) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_TS - pdf_BN) - ui*w_2_x3 ); src[I(index, D3Q19_BN)] = pdf_TS - evenPart - oddPart; src[I(index, D3Q19_TS)] = pdf_BN - evenPart + oddPart; ui = uy + uz; - evenPart = omegaEven*( 0.5*(pdf_TN + pdf_BS) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_TN - pdf_BS) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_TN + pdf_BS) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_TN - pdf_BS) - ui*w_2_x3 ); src[I(index, D3Q19_BS)] = pdf_TN - evenPart - oddPart; src[I(index, D3Q19_TN)] = pdf_BS - evenPart + oddPart; @@ -290,6 +295,7 @@ void FNAME(D3Q19ListAaRiaKernel)(LatticeDesc * ld, KernelData * kernelData, Case int indexStart = threadIndices[threadId]; int indexStop = threadIndices[threadId] + nFluidThread; + // Because of runlength coding iterations are not independent. for (int index = indexStart; index < indexStop; ++index) { #define I(index, dir) P_INDEX_3((nCells), (index), (dir)) @@ -346,9 +352,9 @@ void FNAME(D3Q19ListAaRiaKernel)(LatticeDesc * ld, KernelData * kernelData, Case int z = kdl->Coords[C_INDEX_Z(index)]; if (z == nZ - 4 && x > 3 && x < (nX - 4) && y > 3 && y < (nY - 4)) { - ux = 0.1 * 0.577; - uy = 0.0; - uz = 0.0; + ux = F(0.1) * F(0.5)77; + uy = F(0.0); + uz = F(0.0); } else { #endif ux = pdf_E + pdf_NE + pdf_SE + pdf_TE + pdf_BE - @@ -367,7 +373,7 @@ void FNAME(D3Q19ListAaRiaKernel)(LatticeDesc * ld, KernelData * kernelData, Case pdf_T + pdf_TN + pdf_TE + pdf_TS + pdf_TW + pdf_B + pdf_BN + pdf_BE + pdf_BS + pdf_BW; - dir_indep_trm = dens - (ux * ux + uy * uy + uz * uz)*3.0/2.0; + dir_indep_trm = dens - (ux * ux + uy * uy + uz * uz)*F(3.0)/F(2.0); adjListIndex = index * N_D3Q19_IDX; @@ -378,20 +384,20 @@ void FNAME(D3Q19ListAaRiaKernel)(LatticeDesc * ld, KernelData * kernelData, Case w_1_indep = w_1*dir_indep_trm; ui = uy; - evenPart = omegaEven*( 0.5*(pdf_N + pdf_S) - ui*ui*w_1_nine_half - w_1_indep ); - oddPart = omegaOdd*(0.5*(pdf_N - pdf_S) - ui*w_1_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_N + pdf_S) - ui*ui*w_1_nine_half - w_1_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_N - pdf_S) - ui*w_1_x3 ); *ppdf_S = pdf_N - evenPart - oddPart; *ppdf_N = pdf_S - evenPart + oddPart; ui = ux; - evenPart = omegaEven*( 0.5*(pdf_E + pdf_W) - ui*ui*w_1_nine_half - w_1_indep ); - oddPart = omegaOdd*(0.5*(pdf_E - pdf_W) - ui*w_1_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_E + pdf_W) - ui*ui*w_1_nine_half - w_1_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_E - pdf_W) - ui*w_1_x3 ); *ppdf_W = pdf_E - evenPart - oddPart; *ppdf_E = pdf_W - evenPart + oddPart; ui = uz; - evenPart = omegaEven*( 0.5*(pdf_T + pdf_B) - ui*ui*w_1_nine_half - w_1_indep ); - oddPart = omegaOdd*(0.5*(pdf_T - pdf_B) - ui*w_1_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_T + pdf_B) - ui*ui*w_1_nine_half - w_1_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_T - pdf_B) - ui*w_1_x3 ); *ppdf_B = pdf_T - evenPart - oddPart; *ppdf_T = pdf_B - evenPart + oddPart; @@ -399,38 +405,38 @@ void FNAME(D3Q19ListAaRiaKernel)(LatticeDesc * ld, KernelData * kernelData, Case w_2_indep = w_2*dir_indep_trm; ui = -ux + uy; - evenPart = omegaEven*( 0.5*(pdf_NW + pdf_SE) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_NW - pdf_SE) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_NW + pdf_SE) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_NW - pdf_SE) - ui*w_2_x3 ); *ppdf_SE = pdf_NW - evenPart - oddPart; *ppdf_NW = pdf_SE - evenPart + oddPart; ui = ux + uy; - evenPart = omegaEven*( 0.5*(pdf_NE + pdf_SW) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_NE - pdf_SW) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_NE + pdf_SW) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_NE - pdf_SW) - ui*w_2_x3 ); *ppdf_SW = pdf_NE - evenPart - oddPart; *ppdf_NE = pdf_SW - evenPart + oddPart; ui = -ux + uz; - evenPart = omegaEven*( 0.5*(pdf_TW + pdf_BE) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_TW - pdf_BE) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_TW + pdf_BE) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_TW - pdf_BE) - ui*w_2_x3 ); *ppdf_BE = pdf_TW - evenPart - oddPart; *ppdf_TW = pdf_BE - evenPart + oddPart; ui = ux + uz; - evenPart = omegaEven*( 0.5*(pdf_TE + pdf_BW) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_TE - pdf_BW) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_TE + pdf_BW) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_TE - pdf_BW) - ui*w_2_x3 ); *ppdf_BW = pdf_TE - evenPart - oddPart; *ppdf_TE = pdf_BW - evenPart + oddPart; ui = -uy + uz; - evenPart = omegaEven*( 0.5*(pdf_TS + pdf_BN) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_TS - pdf_BN) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_TS + pdf_BN) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_TS - pdf_BN) - ui*w_2_x3 ); *ppdf_BN = pdf_TS - evenPart - oddPart; *ppdf_TS = pdf_BN - evenPart + oddPart; ui = uy + uz; - evenPart = omegaEven*( 0.5*(pdf_TN + pdf_BS) - ui*ui*w_2_nine_half - w_2_indep ); - oddPart = omegaOdd*(0.5*(pdf_TN - pdf_BS) - ui*w_2_x3 ); + evenPart = omegaEven*( F(0.5)*(pdf_TN + pdf_BS) - ui*ui*w_2_nine_half - w_2_indep ); + oddPart = omegaOdd*(F(0.5)*(pdf_TN - pdf_BS) - ui*w_2_x3 ); *ppdf_BS = pdf_TN - evenPart - oddPart; *ppdf_TN = pdf_BS - evenPart + oddPart; diff --git a/src/BenchKernelD3Q19ListPullSplitNt.c b/src/BenchKernelD3Q19ListPullSplitNt.c index 0132dc9..f617406 100644 --- a/src/BenchKernelD3Q19ListPullSplitNt.c +++ b/src/BenchKernelD3Q19ListPullSplitNt.c @@ -55,8 +55,8 @@ void FNAME(KernelPullSplitNt1S)(LatticeDesc * ld, KernelData * kernelData, CaseD Assert(kernelData != NULL); Assert(cd != NULL); - Assert(cd->Omega > 0.0); - Assert(cd->Omega < 2.0); + Assert(cd->Omega > F(0.0)); + Assert(cd->Omega < F(2.0)); KernelData * kd = (KernelData *)kernelData; KernelDataList * kdl = KDL(kernelData); @@ -65,16 +65,16 @@ void FNAME(KernelPullSplitNt1S)(LatticeDesc * ld, KernelData * kernelData, CaseD PdfT omega = cd->Omega; const PdfT omegaEven = omega; - PdfT magicParam = 1.0 / 12.0; - const PdfT omegaOdd = 1.0 / (0.5 + magicParam / (1.0 / omega - 0.5)); + PdfT magicParam = F(1.0) / F(12.0); + const PdfT omegaOdd = F(1.0) / (F(0.5) + magicParam / (F(1.0) / omega - F(0.5))); - const PdfT w_0 = 1.0 / 3.0; - const PdfT w_1 = 1.0 / 18.0; - const PdfT w_2 = 1.0 / 36.0; + const PdfT w_0 = F(1.0) / F( 3.0); + const PdfT w_1 = F(1.0) / F(18.0); + const PdfT w_2 = F(1.0) / F(36.0); - const PdfT w_1_x3 = w_1 * 3.0; const PdfT w_1_nine_half = w_1 * 9.0 / 2.0; - const PdfT w_2_x3 = w_2 * 3.0; const PdfT w_2_nine_half = w_2 * 9.0 / 2.0; + const PdfT w_1_x3 = w_1 * F(3.0); const PdfT w_1_nine_half = w_1 * F(9.0) / F(2.0); + const PdfT w_2_x3 = w_2 * F(3.0); const PdfT w_2_nine_half = w_2 * F(9.0) / F(2.0); const VPDFT vw_1_x3 = VSET(w_1_x3); const VPDFT vw_2_x3 = VSET(w_2_x3); @@ -85,7 +85,7 @@ void FNAME(KernelPullSplitNt1S)(LatticeDesc * ld, KernelData * kernelData, CaseD const VPDFT vomegaEven = VSET(omegaEven); const VPDFT vomegaOdd = VSET(omegaOdd); - const VPDFT voneHalf = VSET(0.5); + const VPDFT voneHalf = VSET(F(0.5)); // uint32_t nConsecNodes = kdlr->nConsecNodes; // uint32_t * consecNodes = kdlr->ConsecNodes; @@ -266,8 +266,8 @@ void FNAME(KernelPullSplitNt2S)(LatticeDesc * ld, KernelData * kernelData, CaseD Assert(kernelData != NULL); Assert(cd != NULL); - Assert(cd->Omega > 0.0); - Assert(cd->Omega < 2.0); + Assert(cd->Omega > F(0.0)); + Assert(cd->Omega < F(2.0)); KernelData * kd = (KernelData *)kernelData; KernelDataList * kdl = KDL(kernelData); @@ -276,16 +276,15 @@ void FNAME(KernelPullSplitNt2S)(LatticeDesc * ld, KernelData * kernelData, CaseD PdfT omega = cd->Omega; const PdfT omegaEven = omega; - PdfT magicParam = 1.0 / 12.0; - const PdfT omegaOdd = 1.0 / (0.5 + magicParam / (1.0 / omega - 0.5)); + PdfT magicParam = F(1.0) / F(12.0); + const PdfT omegaOdd = F(1.0) / (F(0.5) + magicParam / (F(1.0) / omega - F(0.5))); + const PdfT w_0 = F(1.0) / F( 3.0); + const PdfT w_1 = F(1.0) / F(18.0); + const PdfT w_2 = F(1.0) / F(36.0); - const PdfT w_0 = 1.0 / 3.0; - const PdfT w_1 = 1.0 / 18.0; - const PdfT w_2 = 1.0 / 36.0; - - const PdfT w_1_x3 = w_1 * 3.0; const PdfT w_1_nine_half = w_1 * 9.0 / 2.0; - const PdfT w_2_x3 = w_2 * 3.0; const PdfT w_2_nine_half = w_2 * 9.0 / 2.0; + const PdfT w_1_x3 = w_1 * F(3.0); const PdfT w_1_nine_half = w_1 * F(9.0) / F(2.0); + const PdfT w_2_x3 = w_2 * F(3.0); const PdfT w_2_nine_half = w_2 * F(9.0) / F(2.0); const VPDFT vw_1_x3 = VSET(w_1_x3); const VPDFT vw_2_x3 = VSET(w_2_x3); @@ -296,7 +295,7 @@ void FNAME(KernelPullSplitNt2S)(LatticeDesc * ld, KernelData * kernelData, CaseD const VPDFT vomegaEven = VSET(omegaEven); const VPDFT vomegaOdd = VSET(omegaOdd); - const VPDFT voneHalf = VSET(0.5); + const VPDFT voneHalf = VSET(F(0.5)); // uint32_t nConsecNodes = kdlr->nConsecNodes; // uint32_t * consecNodes = kdlr->ConsecNodes; diff --git a/src/BenchKernelD3Q19ListPullSplitNt1SIntrinsics.h b/src/BenchKernelD3Q19ListPullSplitNt1SIntrinsics.h index a3e586b..bb075fa 100644 --- a/src/BenchKernelD3Q19ListPullSplitNt1SIntrinsics.h +++ b/src/BenchKernelD3Q19ListPullSplitNt1SIntrinsics.h @@ -41,6 +41,9 @@ #ifdef DEBUG memset(tmpArray, -1, sizeof(PdfT) * nTmpArray * N_TMP); #endif + #ifdef INTEL_OPT_DIRECTIVES + #pragma ivdep + #endif for (int index = 0; index < indexMax; ++index) { @@ -69,7 +72,7 @@ pdf_T + pdf_TN + pdf_TE + pdf_TS + pdf_TW + pdf_B + pdf_BN + pdf_BE + pdf_BS + pdf_BW; - dir_indep_trm = dens - (ux * ux + uy * uy + uz * uz) * 3.0 / 2.0; + dir_indep_trm = dens - (ux * ux + uy * uy + uz * uz) * F(3.0) / F(2.0); w_1_indep = w_1 * dir_indep_trm; w_2_indep = w_2 * dir_indep_trm; diff --git a/src/BenchKernelD3Q19ListPullSplitNt1SScalar.h b/src/BenchKernelD3Q19ListPullSplitNt1SScalar.h index 9b833ae..8ce6269 100644 --- a/src/BenchKernelD3Q19ListPullSplitNt1SScalar.h +++ b/src/BenchKernelD3Q19ListPullSplitNt1SScalar.h @@ -69,7 +69,7 @@ pdf_T + pdf_TN + pdf_TE + pdf_TS + pdf_TW + pdf_B + pdf_BN + pdf_BE + pdf_BS + pdf_BW; - dir_indep_trm = dens - (ux * ux + uy * uy + uz * uz) * 3.0 / 2.0; + dir_indep_trm = dens - (ux * ux + uy * uy + uz * uz) * F(3.0) / F(2.0); w_1_indep = w_1 * dir_indep_trm; w_2_indep = w_2 * dir_indep_trm; @@ -89,8 +89,8 @@ w_1_indep = tmpArray[TMP_INDEX(index, TMP_W1)]; \ \ ui = _vel; \ - evenPart = omegaEven * (0.5 * (JOIN(pdf_,_dir1) + JOIN(pdf_,_dir2)) - ui * ui * w_1_nine_half - w_1_indep); \ - oddPart = omegaOdd * (0.5 * (JOIN(pdf_,_dir1) - JOIN(pdf_,_dir2)) - ui * w_1_x3); \ + evenPart = omegaEven * (F(0.5) * (JOIN(pdf_,_dir1) + JOIN(pdf_,_dir2)) - ui * ui * w_1_nine_half - w_1_indep); \ + oddPart = omegaOdd * (F(0.5) * (JOIN(pdf_,_dir1) - JOIN(pdf_,_dir2)) - ui * w_1_x3); \ dst[I(index + blockedIndex, JOIN(D3Q19_,_dir1) )] = JOIN(pdf_,_dir1) - evenPart - oddPart; \ tmpArray[TMP_INDEX(index, JOIN(D3Q19_,_dir2))] = JOIN(pdf_,_dir2) - evenPart + oddPart; \ } \ @@ -107,8 +107,8 @@ w_2_indep = tmpArray[TMP_INDEX(index, TMP_W2)]; \ \ ui = _expr; \ - evenPart = omegaEven * (0.5 * (JOIN(pdf_,_dir1) + JOIN(pdf_,_dir2)) - ui * ui * w_2_nine_half - w_2_indep); \ - oddPart = omegaOdd * (0.5 * (JOIN(pdf_,_dir1) - JOIN(pdf_,_dir2)) - ui * w_2_x3); \ + evenPart = omegaEven * (F(0.5) * (JOIN(pdf_,_dir1) + JOIN(pdf_,_dir2)) - ui * ui * w_2_nine_half - w_2_indep); \ + oddPart = omegaOdd * (F(0.5) * (JOIN(pdf_,_dir1) - JOIN(pdf_,_dir2)) - ui * w_2_x3); \ dst[I(index + blockedIndex, JOIN(D3Q19_,_dir1))] = JOIN(pdf_,_dir1) - evenPart - oddPart; \ tmpArray[TMP_INDEX(index, JOIN(D3Q19_,_dir2))] = JOIN(pdf_,_dir2) - evenPart + oddPart; \ } \ diff --git a/src/BenchKernelD3Q19ListPullSplitNt2SIntrinsics.h b/src/BenchKernelD3Q19ListPullSplitNt2SIntrinsics.h index 399fa5f..de858ec 100644 --- a/src/BenchKernelD3Q19ListPullSplitNt2SIntrinsics.h +++ b/src/BenchKernelD3Q19ListPullSplitNt2SIntrinsics.h @@ -41,6 +41,9 @@ #ifdef DEBUG memset(tmpArray, -1, sizeof(PdfT) * nTmpArray * N_TMP); #endif + #ifdef INTEL_OPT_DIRECTIVES + #pragma ivdep + #endif for (int index = 0; index < indexMax; ++index) { @@ -69,7 +72,7 @@ pdf_T + pdf_TN + pdf_TE + pdf_TS + pdf_TW + pdf_B + pdf_BN + pdf_BE + pdf_BS + pdf_BW; - dir_indep_trm = dens - (ux * ux + uy * uy + uz * uz) * 3.0 / 2.0; + dir_indep_trm = dens - (ux * ux + uy * uy + uz * uz) * F(3.0) / F(2.0); w_1_indep = w_1 * dir_indep_trm; w_2_indep = w_2 * dir_indep_trm; diff --git a/src/BenchKernelD3Q19ListPullSplitNt2SScalar.h b/src/BenchKernelD3Q19ListPullSplitNt2SScalar.h index ca1f3dd..c8abb1f 100644 --- a/src/BenchKernelD3Q19ListPullSplitNt2SScalar.h +++ b/src/BenchKernelD3Q19ListPullSplitNt2SScalar.h @@ -69,7 +69,7 @@ pdf_T + pdf_TN + pdf_TE + pdf_TS + pdf_TW + pdf_B + pdf_BN + pdf_BE + pdf_BS + pdf_BW; - dir_indep_trm = dens - (ux * ux + uy * uy + uz * uz) * 3.0 / 2.0; + dir_indep_trm = dens - (ux * ux + uy * uy + uz * uz) * F(3.0) / F(2.0); w_1_indep = w_1 * dir_indep_trm; w_2_indep = w_2 * dir_indep_trm; @@ -89,8 +89,8 @@ w_1_indep = tmpArray[TMP_INDEX(index, TMP_W1)]; \ \ ui = _vel; \ - evenPart = omegaEven * (0.5 * (JOIN(pdf_,_dir1) + JOIN(pdf_,_dir2)) - ui * ui * w_1_nine_half - w_1_indep); \ - oddPart = omegaOdd * (0.5 * (JOIN(pdf_,_dir1) - JOIN(pdf_,_dir2)) - ui * w_1_x3); \ + evenPart = omegaEven * (F(0.5) * (JOIN(pdf_,_dir1) + JOIN(pdf_,_dir2)) - ui * ui * w_1_nine_half - w_1_indep); \ + oddPart = omegaOdd * (F(0.5) * (JOIN(pdf_,_dir1) - JOIN(pdf_,_dir2)) - ui * w_1_x3); \ dst[I(index + blockedIndex, JOIN(D3Q19_,_dir1) )] = JOIN(pdf_,_dir1) - evenPart - oddPart; \ dst[I(index + blockedIndex, JOIN(D3Q19_,_dir2) )] = JOIN(pdf_,_dir2) - evenPart + oddPart; \ } @@ -104,8 +104,8 @@ w_2_indep = tmpArray[TMP_INDEX(index, TMP_W2)]; \ \ ui = _expr; \ - evenPart = omegaEven * (0.5 * (JOIN(pdf_,_dir1) + JOIN(pdf_,_dir2)) - ui * ui * w_2_nine_half - w_2_indep); \ - oddPart = omegaOdd * (0.5 * (JOIN(pdf_,_dir1) - JOIN(pdf_,_dir2)) - ui * w_2_x3); \ + evenPart = omegaEven * (F(0.5) * (JOIN(pdf_,_dir1) + JOIN(pdf_,_dir2)) - ui * ui * w_2_nine_half - w_2_indep); \ + oddPart = omegaOdd * (F(0.5) * (JOIN(pdf_,_dir1) - JOIN(pdf_,_dir2)) - ui * w_2_x3); \ dst[I(index + blockedIndex, JOIN(D3Q19_,_dir1))] = JOIN(pdf_,_dir1) - evenPart - oddPart; \ dst[I(index + blockedIndex, JOIN(D3Q19_,_dir2))] = JOIN(pdf_,_dir2) - evenPart + oddPart; \ } diff --git a/src/Config.h b/src/Config.h new file mode 100644 index 0000000..f637416 --- /dev/null +++ b/src/Config.h @@ -0,0 +1,8 @@ +#ifndef __CONFIG_H__ +#define __CONFIG_H__ + +#ifdef __INTEL_COMPILER + #define INTEL_OPT_DIRECTIVES +#endif + +#endif // __CONFIG_H__ diff --git a/src/Geometry.c b/src/Geometry.c index 0450c56..c7e3f20 100644 --- a/src/Geometry.c +++ b/src/Geometry.c @@ -32,6 +32,8 @@ #include +static const char * g_geoTypeStr[] = { "box", "channel", "pipe", "blocks", "fluid" }; + void GeoCreateByStr(const char * geometryType, int dims[3], int periodic[3], LatticeDesc * ld) { int type = -1; @@ -102,9 +104,9 @@ void GeoCreateByType(GEO_TYPES type, void * typeDetails, int dims[3], int period Assert(type >= GEO_TYPE_MIN); Assert(type <= GEO_TYPE_MAX); - const char * geoTypeStr[] = { "box", "channel", "pipe", "blocks", "fluid" }; + // const char * geoTypeStr[] = { "box", "channel", "pipe", "blocks", "fluid" }; - printf("# geometry: %d x %d x %d nodes, type %d %s\n", dims[0], dims[1], dims[2], type, geoTypeStr[type]); + // printf("# geometry: %d x %d x %d nodes, type %d %s\n", dims[0], dims[1], dims[2], type, geoTypeStr[type]); ld->Dims[0] = dims[0]; ld->Dims[1] = dims[1]; @@ -113,6 +115,7 @@ void GeoCreateByType(GEO_TYPES type, void * typeDetails, int dims[3], int period ld->PeriodicX = periodic[0]; ld->PeriodicY = periodic[1]; ld->PeriodicZ = periodic[2]; + ld->Name = g_geoTypeStr[type]; LatticeT * lattice; MemAlloc((void **)&lattice, sizeof(LatticeT) * dims[0] * dims[1] * dims[2]); diff --git a/src/Kernel.c b/src/Kernel.c index 88018b4..cba5166 100644 --- a/src/Kernel.c +++ b/src/Kernel.c @@ -68,21 +68,21 @@ void KernelComputeBoundaryConditions(KernelData * kd, LatticeDesc * ld, CaseData Assert(ld != NULL); Assert(cd != NULL); - Assert(cd->RhoIn > 0.0); - Assert(cd->RhoOut > 0.0); + Assert(cd->RhoIn > F(0.0)); + Assert(cd->RhoOut > F(0.0)); PdfT rho_in = cd->RhoIn; PdfT rho_out = cd->RhoOut; - PdfT rho_in_inv = 1.0 / rho_in; - PdfT rho_out_inv = 1.0 / rho_out; - PdfT indep_ux = 0.0; + PdfT rho_in_inv = F(1.0) / rho_in; + PdfT rho_out_inv = F(1.0) / rho_out; + PdfT indep_ux = F(0.0); PdfT dens; PdfT ux; - const PdfT one_third = 1.0 / 3.0; - const PdfT one_fourth = 1.0 / 4.0; - const PdfT one_sixth = 1.0 / 6.0; + const PdfT one_third = F(1.0) / F(3.0); + const PdfT one_fourth = F(1.0) / F(4.0); + const PdfT one_sixth = F(1.0) / F(6.0); PdfT pdfs[N_D3Q19]; @@ -126,10 +126,10 @@ void KernelComputeBoundaryConditions(KernelData * kd, LatticeDesc * ld, CaseData dens = rho_in; - ux = 1 - (pdfs[D3Q19_C] + + ux = F(1.0) - (pdfs[D3Q19_C] + (pdfs[D3Q19_T] + pdfs[D3Q19_B] + pdfs[D3Q19_S] + pdfs[D3Q19_N]) + (pdfs[D3Q19_TS] + pdfs[D3Q19_BS] + pdfs[D3Q19_TN] + pdfs[D3Q19_BN]) + - 2 * (pdfs[D3Q19_SW] + pdfs[D3Q19_TW] + pdfs[D3Q19_W] + pdfs[D3Q19_BW] + pdfs[D3Q19_NW])) * rho_in_inv; + F(2.0) * (pdfs[D3Q19_SW] + pdfs[D3Q19_TW] + pdfs[D3Q19_W] + pdfs[D3Q19_BW] + pdfs[D3Q19_NW])) * rho_in_inv; indep_ux = one_sixth * dens * ux; @@ -176,10 +176,10 @@ void KernelComputeBoundaryConditions(KernelData * kd, LatticeDesc * ld, CaseData dens = rho_out; - ux = -1 + (pdfs[D3Q19_C] + + ux = F(-1.0) + (pdfs[D3Q19_C] + (pdfs[D3Q19_T] + pdfs[D3Q19_B] + pdfs[D3Q19_S] + pdfs[D3Q19_N]) + (pdfs[D3Q19_TS] + pdfs[D3Q19_BS] + pdfs[D3Q19_TN] + pdfs[D3Q19_BN]) + - 2 * (pdfs[D3Q19_NE] + pdfs[D3Q19_BE] + pdfs[D3Q19_E] + pdfs[D3Q19_TE] + pdfs[D3Q19_SE])) * rho_out_inv; + F(2.0) * (pdfs[D3Q19_NE] + pdfs[D3Q19_BE] + pdfs[D3Q19_E] + pdfs[D3Q19_TE] + pdfs[D3Q19_SE])) * rho_out_inv; indep_ux = one_sixth * dens * ux; pdfs[D3Q19_W ] = pdfs[D3Q19_E] - one_third * dens * ux; @@ -234,13 +234,16 @@ PdfT KernelDensity(KernelData * kd, LatticeDesc * ld) kd->GetNode(kd, x, y, z, pdfs); + PdfT localDensity = F(0.0); + for(int d = 0; d < N_D3Q19; ++d) { // if (pdfs[d] < 0.0) { // printf("# %d %d %d %d < 0 %e %s\n", x, y, z, d, pdfs[d], D3Q19_NAMES[d]); // exit(1); // } - density += pdfs[d]; + localDensity += pdfs[d]; } + density += localDensity; } } @@ -259,22 +262,22 @@ void KernelSetInitialDensity(LatticeDesc * ld, KernelData * kd, CaseData * cd) PdfT rho_in = cd->RhoIn; PdfT rho_out = cd->RhoOut; - PdfT ux = 0.0; - PdfT uy = 0.0; - PdfT uz = 0.0; - PdfT dens = 1.0; + PdfT ux = F(0.0); + PdfT uy = F(0.0); + PdfT uz = F(0.0); + PdfT dens = F(1.0); PdfT omega = cd->Omega; - PdfT w_0 = 1.0 / 3.0; - PdfT w_1 = 1.0 / 18.0; - PdfT w_2 = 1.0 / 36.0; + PdfT w_0 = F(1.0) / F( 3.0); + PdfT w_1 = F(1.0) / F(18.0); + PdfT w_2 = F(1.0) / F(36.0); PdfT dir_indep_trm; - PdfT omega_w0 = 3.0 * w_0 * omega; - PdfT omega_w1 = 3.0 * w_1 * omega; - PdfT omega_w2 = 3.0 * w_2 * omega; - PdfT one_third = 1.0 / 3.0; + PdfT omega_w0 = F(3.0) * w_0 * omega; + PdfT omega_w1 = F(3.0) * w_1 * omega; + PdfT omega_w2 = F(3.0) * w_2 * omega; + PdfT one_third = F(1.0) / F(3.0); int nX = lDims[0]; int nY = lDims[1]; @@ -290,43 +293,43 @@ void KernelSetInitialDensity(LatticeDesc * ld, KernelData * kd, CaseData * cd) if (ld->Lattice[L_INDEX_4(ld->Dims, x, y, z)] != LAT_CELL_OBSTACLE) { // TODO: fix later. // if((caseData->geoType == GEO_TYPE_CHANNEL) || (caseData->geoType == GEO_TYPE_RCHANNEL)) - dens = rho_in + (rho_out - rho_in)*(x)/(nX-1.0); + dens = rho_in + (rho_out - rho_in) * (x) / (nX - F(1.0)); #define SQR(a) ((a)*(a)) - dir_indep_trm = one_third * dens - 0.5 * (ux * ux + uy * uy + uz * uz); + dir_indep_trm = one_third * dens - F(0.5) * (ux * ux + uy * uy + uz * uz); pdfs[D3Q19_C] = omega_w0 * (dir_indep_trm); - pdfs[D3Q19_NW] = omega_w2 * (dir_indep_trm - (ux - uy) + 1.5 * SQR(ux - uy)); - pdfs[D3Q19_SE] = omega_w2 * (dir_indep_trm + (ux - uy) + 1.5 * SQR(ux - uy)); + pdfs[D3Q19_NW] = omega_w2 * (dir_indep_trm - (ux - uy) + F(1.5) * SQR(ux - uy)); + pdfs[D3Q19_SE] = omega_w2 * (dir_indep_trm + (ux - uy) + F(1.5) * SQR(ux - uy)); - pdfs[D3Q19_NE] = omega_w2 * (dir_indep_trm + (ux + uy) + 1.5 * SQR(ux + uy)); - pdfs[D3Q19_SW] = omega_w2 * (dir_indep_trm - (ux + uy) + 1.5 * SQR(ux + uy)); + pdfs[D3Q19_NE] = omega_w2 * (dir_indep_trm + (ux + uy) + F(1.5) * SQR(ux + uy)); + pdfs[D3Q19_SW] = omega_w2 * (dir_indep_trm - (ux + uy) + F(1.5) * SQR(ux + uy)); - pdfs[D3Q19_TW] = omega_w2 * (dir_indep_trm - (ux - uz) + 1.5 * SQR(ux - uz)); - pdfs[D3Q19_BE] = omega_w2 * (dir_indep_trm + (ux - uz) + 1.5 * SQR(ux - uz)); + pdfs[D3Q19_TW] = omega_w2 * (dir_indep_trm - (ux - uz) + F(1.5) * SQR(ux - uz)); + pdfs[D3Q19_BE] = omega_w2 * (dir_indep_trm + (ux - uz) + F(1.5) * SQR(ux - uz)); - pdfs[D3Q19_TE] = omega_w2 * (dir_indep_trm + (ux + uz) + 1.5 * SQR(ux + uz)); - pdfs[D3Q19_BW] = omega_w2 * (dir_indep_trm - (ux + uz) + 1.5 * SQR(ux + uz)); + pdfs[D3Q19_TE] = omega_w2 * (dir_indep_trm + (ux + uz) + F(1.5) * SQR(ux + uz)); + pdfs[D3Q19_BW] = omega_w2 * (dir_indep_trm - (ux + uz) + F(1.5) * SQR(ux + uz)); - pdfs[D3Q19_TS] = omega_w2 * (dir_indep_trm - (uy - uz) + 1.5 * SQR(uy - uz)); - pdfs[D3Q19_BN] = omega_w2 * (dir_indep_trm + (uy - uz) + 1.5 * SQR(uy - uz)); + pdfs[D3Q19_TS] = omega_w2 * (dir_indep_trm - (uy - uz) + F(1.5) * SQR(uy - uz)); + pdfs[D3Q19_BN] = omega_w2 * (dir_indep_trm + (uy - uz) + F(1.5) * SQR(uy - uz)); - pdfs[D3Q19_TN] = omega_w2 * (dir_indep_trm + (uy + uz) + 1.5 * SQR(uy + uz)); - pdfs[D3Q19_BS] = omega_w2 * (dir_indep_trm - (uy + uz) + 1.5 * SQR(uy + uz)); + pdfs[D3Q19_TN] = omega_w2 * (dir_indep_trm + (uy + uz) + F(1.5) * SQR(uy + uz)); + pdfs[D3Q19_BS] = omega_w2 * (dir_indep_trm - (uy + uz) + F(1.5) * SQR(uy + uz)); - pdfs[D3Q19_N] = omega_w1 * (dir_indep_trm + uy + 1.5 * SQR(uy)); - pdfs[D3Q19_S] = omega_w1 * (dir_indep_trm - uy + 1.5 * SQR(uy)); + pdfs[D3Q19_N] = omega_w1 * (dir_indep_trm + uy + F(1.5) * SQR(uy)); + pdfs[D3Q19_S] = omega_w1 * (dir_indep_trm - uy + F(1.5) * SQR(uy)); - pdfs[D3Q19_E] = omega_w1 * (dir_indep_trm + ux + 1.5 * SQR(ux)); - pdfs[D3Q19_W] = omega_w1 * (dir_indep_trm - ux + 1.5 * SQR(ux)); + pdfs[D3Q19_E] = omega_w1 * (dir_indep_trm + ux + F(1.5) * SQR(ux)); + pdfs[D3Q19_W] = omega_w1 * (dir_indep_trm - ux + F(1.5) * SQR(ux)); - pdfs[D3Q19_T] = omega_w1 * (dir_indep_trm + uz + 1.5 * SQR(uz)); - pdfs[D3Q19_B] = omega_w1 * (dir_indep_trm - uz + 1.5 * SQR(uz)); + pdfs[D3Q19_T] = omega_w1 * (dir_indep_trm + uz + F(1.5) * SQR(uz)); + pdfs[D3Q19_B] = omega_w1 * (dir_indep_trm - uz + F(1.5) * SQR(uz)); kd->SetNode(kd, x, y, z, pdfs); @@ -343,23 +346,23 @@ void KernelSetInitialVelocity(LatticeDesc * ld, KernelData * kd, CaseData * cd) int * lDims = ld->Dims; - // TODO: ux is overriden below... - PdfT ux = 0.09; // caseData->initUx; - PdfT uy = 0.0; // caseData->initUy; - PdfT uz = 0.0; // caseData->initUz; - PdfT dens = 1.0; + // TODO: fix ux is overriden below + PdfT ux = F(0.0); + PdfT uy = F(0.0); + PdfT uz = F(0.0); + PdfT dens = F(1.0); PdfT omega = cd->Omega; - PdfT w_0 = 1.0 / 3.0; - PdfT w_1 = 1.0 / 18.0; - PdfT w_2 = 1.0 / 36.0; + PdfT w_0 = F(1.0) / F( 3.0); + PdfT w_1 = F(1.0) / F(18.0); + PdfT w_2 = F(1.0) / F(36.0); PdfT dir_indep_trm; - PdfT omega_w0 = 3.0 * w_0 * omega; - PdfT omega_w1 = 3.0 * w_1 * omega; - PdfT omega_w2 = 3.0 * w_2 * omega; - PdfT one_third = 1.0 / 3.0; + PdfT omega_w0 = F(3.0) * w_0 * omega; + PdfT omega_w1 = F(3.0) * w_1 * omega; + PdfT omega_w2 = F(3.0) * w_2 * omega; + PdfT one_third = F(1.0) / F(3.0); int nX = lDims[0]; int nY = lDims[1]; @@ -376,14 +379,14 @@ void KernelSetInitialVelocity(LatticeDesc * ld, KernelData * kd, CaseData * cd) if (ld->Lattice[L_INDEX_4(ld->Dims, x, y, z)] == LAT_CELL_FLUID) { - ux = 0.0; - uy = 0.0; - uz = 0.0; + ux = F(0.0); + uy = F(0.0); + uz = F(0.0); kd->GetNode(kd, x, y, z, pdfs); - density = 0.0; + density = F(0.0); #define X(name, idx, idxinv, _x, _y, _z) density += pdfs[idx]; D3Q19_LIST @@ -391,39 +394,39 @@ void KernelSetInitialVelocity(LatticeDesc * ld, KernelData * kd, CaseData * cd) #define SQR(a) ((a)*(a)) - dir_indep_trm = one_third * dens - 0.5 * (ux * ux + uy * uy + uz * uz); + dir_indep_trm = one_third * dens - F(0.5) * (ux * ux + uy * uy + uz * uz); pdfs[D3Q19_C] = omega_w0 * (dir_indep_trm); - pdfs[D3Q19_NW] = omega_w2 * (dir_indep_trm - (ux - uy) + 1.5 * SQR(ux - uy)); - pdfs[D3Q19_SE] = omega_w2 * (dir_indep_trm + (ux - uy) + 1.5 * SQR(ux - uy)); + pdfs[D3Q19_NW] = omega_w2 * (dir_indep_trm - (ux - uy) + F(1.5) * SQR(ux - uy)); + pdfs[D3Q19_SE] = omega_w2 * (dir_indep_trm + (ux - uy) + F(1.5) * SQR(ux - uy)); - pdfs[D3Q19_NE] = omega_w2 * (dir_indep_trm + (ux + uy) + 1.5 * SQR(ux + uy)); - pdfs[D3Q19_SW] = omega_w2 * (dir_indep_trm - (ux + uy) + 1.5 * SQR(ux + uy)); + pdfs[D3Q19_NE] = omega_w2 * (dir_indep_trm + (ux + uy) + F(1.5) * SQR(ux + uy)); + pdfs[D3Q19_SW] = omega_w2 * (dir_indep_trm - (ux + uy) + F(1.5) * SQR(ux + uy)); - pdfs[D3Q19_TW] = omega_w2 * (dir_indep_trm - (ux - uz) + 1.5 * SQR(ux - uz)); - pdfs[D3Q19_BE] = omega_w2 * (dir_indep_trm + (ux - uz) + 1.5 * SQR(ux - uz)); + pdfs[D3Q19_TW] = omega_w2 * (dir_indep_trm - (ux - uz) + F(1.5) * SQR(ux - uz)); + pdfs[D3Q19_BE] = omega_w2 * (dir_indep_trm + (ux - uz) + F(1.5) * SQR(ux - uz)); - pdfs[D3Q19_TE] = omega_w2 * (dir_indep_trm + (ux + uz) + 1.5 * SQR(ux + uz)); - pdfs[D3Q19_BW] = omega_w2 * (dir_indep_trm - (ux + uz) + 1.5 * SQR(ux + uz)); + pdfs[D3Q19_TE] = omega_w2 * (dir_indep_trm + (ux + uz) + F(1.5) * SQR(ux + uz)); + pdfs[D3Q19_BW] = omega_w2 * (dir_indep_trm - (ux + uz) + F(1.5) * SQR(ux + uz)); - pdfs[D3Q19_TS] = omega_w2 * (dir_indep_trm - (uy - uz) + 1.5 * SQR(uy - uz)); - pdfs[D3Q19_BN] = omega_w2 * (dir_indep_trm + (uy - uz) + 1.5 * SQR(uy - uz)); + pdfs[D3Q19_TS] = omega_w2 * (dir_indep_trm - (uy - uz) + F(1.5) * SQR(uy - uz)); + pdfs[D3Q19_BN] = omega_w2 * (dir_indep_trm + (uy - uz) + F(1.5) * SQR(uy - uz)); - pdfs[D3Q19_TN] = omega_w2 * (dir_indep_trm + (uy + uz) + 1.5 * SQR(uy + uz)); - pdfs[D3Q19_BS] = omega_w2 * (dir_indep_trm - (uy + uz) + 1.5 * SQR(uy + uz)); + pdfs[D3Q19_TN] = omega_w2 * (dir_indep_trm + (uy + uz) + F(1.5) * SQR(uy + uz)); + pdfs[D3Q19_BS] = omega_w2 * (dir_indep_trm - (uy + uz) + F(1.5) * SQR(uy + uz)); - pdfs[D3Q19_N] = omega_w1 * (dir_indep_trm + uy + 1.5 * SQR(uy)); - pdfs[D3Q19_S] = omega_w1 * (dir_indep_trm - uy + 1.5 * SQR(uy)); + pdfs[D3Q19_N] = omega_w1 * (dir_indep_trm + uy + F(1.5) * SQR(uy)); + pdfs[D3Q19_S] = omega_w1 * (dir_indep_trm - uy + F(1.5) * SQR(uy)); - pdfs[D3Q19_E] = omega_w1 * (dir_indep_trm + ux + 1.5 * SQR(ux)); - pdfs[D3Q19_W] = omega_w1 * (dir_indep_trm - ux + 1.5 * SQR(ux)); + pdfs[D3Q19_E] = omega_w1 * (dir_indep_trm + ux + F(1.5) * SQR(ux)); + pdfs[D3Q19_W] = omega_w1 * (dir_indep_trm - ux + F(1.5) * SQR(ux)); - pdfs[D3Q19_T] = omega_w1 * (dir_indep_trm + uz + 1.5 * SQR(uz)); - pdfs[D3Q19_B] = omega_w1 * (dir_indep_trm - uz + 1.5 * SQR(uz)); + pdfs[D3Q19_T] = omega_w1 * (dir_indep_trm + uz + F(1.5) * SQR(uz)); + pdfs[D3Q19_B] = omega_w1 * (dir_indep_trm - uz + F(1.5) * SQR(uz)); #undef SQR @@ -447,7 +450,7 @@ void KernelSetInitialVelocity(LatticeDesc * ld, KernelData * kd, CaseData * cd) // static PdfT CalcXVelForPipeProfile(PdfT maxRadiusSquared, PdfT curRadiusSquared, PdfT xForce, PdfT viscosity) { - return xForce*(maxRadiusSquared - curRadiusSquared) / (2.0*viscosity); + return xForce * (maxRadiusSquared - curRadiusSquared) / (F(2.0) * viscosity); } static void KernelGetXSlice(LatticeDesc * ld, KernelData * kd, CaseData * cd, PdfT * outputArray, int xPos) @@ -462,7 +465,7 @@ static void KernelGetXSlice(LatticeDesc * ld, KernelData * kd, CaseData * cd, Pd Assert(xPos < ld->Dims[0]); - PdfT ux = 0.0; + PdfT ux = F(0.0); // Declare pdf_N, pdf_E, pdf_S, pdf_W, ... #define X(name, idx, idxinv, x, y, z) PdfT JOIN(pdf_,name); @@ -486,13 +489,13 @@ static void KernelGetXSlice(LatticeDesc * ld, KernelData * kd, CaseData * cd, Pd pdf_W - pdf_NW - pdf_SW - pdf_TW - pdf_BW; #ifdef VERIFICATION - ux += 0.5 * cd->XForce; + ux += F(0.5) * cd->XForce; #endif outputArray[y * nZ + z] = ux; } else { - outputArray[y * nZ + z] = 0.0; + outputArray[y * nZ + z] = F(0.0); } } } @@ -517,7 +520,7 @@ void KernelVerifiy(LatticeDesc * ld, KernelData * kd, CaseData * cd, PdfT * erro int nZ = ld->Dims[2]; PdfT omega = cd->Omega; - PdfT viscosity = (1.0 / omega - 0.5) / 3.0; + PdfT viscosity = (F(1.0) / omega - F(0.5)) / F(3.0); // ux averaged across cross sections in x direction PdfT * outputArray = (PdfT *)malloc(nZ * nY * sizeof(PdfT)); @@ -532,15 +535,15 @@ void KernelVerifiy(LatticeDesc * ld, KernelData * kd, CaseData * cd, PdfT * erro FILE * fh; char fileName[1024]; - PdfT tmpAvgUx = 0.0; - PdfT tmpAnalyUx = 0.0; + PdfT tmpAvgUx = F(0.0); + PdfT tmpAnalyUx = F(0.0); int flagEvenNy = 0; int y = 0; if (nY % 2 == 0) flagEvenNy = 1; - y = (nY-flagEvenNy-1)/2; + y = (nY - flagEvenNy - 1) / 2; snprintf(fileName, sizeof(fileName), "flow-profile.dat"); @@ -559,37 +562,37 @@ void KernelVerifiy(LatticeDesc * ld, KernelData * kd, CaseData * cd, PdfT * erro fprintf(fh, "# Plot graphically: gnuplot -e \"plot \\\"%s\\\" u 1:3 w linesp t \\\"analytical\\\", \\\"\\\" u 1:4 w linesp t \\\"simulation\\\"; pause -1;\"\n", fileName); fprintf(fh, "# z coord., radius, analytic, simulation, diff abs, diff rel, undim_analytic, undim_sim\n"); - double deviation = 0.0; - double curRadiusSquared; - double center = nY / 2.0; - double minDiameter = nY; + PdfT deviation = F(0.0); + PdfT curRadiusSquared; + PdfT center = nY / F(2.0); + PdfT minDiameter = (PdfT)nY; #define SQR(a) ((a)*(a)) - double minRadiusSquared = SQR(minDiameter / 2.0 - 1.0); + PdfT minRadiusSquared = SQR(minDiameter / F(2.0) - F(1.0)); #undef SQR - double u_max = cd->XForce*minRadiusSquared/(2.0*viscosity); + PdfT u_max = cd->XForce*minRadiusSquared / (F(2.0) * viscosity); for(int z = 0; z < nZ; ++z) { fprintf(fh, "%d\t", z); #define SQR(a) ((a)*(a)) - curRadiusSquared = SQR(z-center+0.5); + curRadiusSquared = SQR(z - center + F(0.5)); // dimensionless radius - fprintf(fh, "%e\t", (z-center+0.5)/center); + fprintf(fh, "%e\t", (z - center + F(0.5)) / center); // analytic profile if(curRadiusSquared >= minRadiusSquared) - tmpAnalyUx = 0.0; + tmpAnalyUx = F(0.0); else tmpAnalyUx = CalcXVelForPipeProfile(minRadiusSquared, curRadiusSquared, cd->XForce, viscosity); //averaged profile if(flagEvenNy == 1) - tmpAvgUx = (outputArray[y*nZ + z] + outputArray[(y+1)*nZ + z])/2.0; + tmpAvgUx = (outputArray[y * nZ + z] + outputArray[(y + 1) * nZ + z]) / F(2.0); else - tmpAvgUx = outputArray[y*nZ + z]; + tmpAvgUx = outputArray[y * nZ + z]; fprintf(fh, "%e\t", tmpAnalyUx); fprintf(fh, "%e\t", tmpAvgUx); @@ -597,7 +600,7 @@ void KernelVerifiy(LatticeDesc * ld, KernelData * kd, CaseData * cd, PdfT * erro fprintf(fh, "%e\t", fabs(tmpAnalyUx-tmpAvgUx)); if (tmpAnalyUx != 0.0) { fprintf(fh, "%e\t", fabs(tmpAnalyUx - tmpAvgUx) / tmpAnalyUx); - deviation += SQR(fabs(tmpAnalyUx - tmpAvgUx) / tmpAnalyUx); + deviation += SQR((PdfT)fabs(tmpAnalyUx - tmpAvgUx) / tmpAnalyUx); } else { fprintf(fh, "0.0\t"); @@ -610,7 +613,7 @@ void KernelVerifiy(LatticeDesc * ld, KernelData * kd, CaseData * cd, PdfT * erro #undef SQR } - *errorNorm = sqrt(deviation); + *errorNorm = (PdfT)sqrt(deviation); printf("# Kernel validation: L2 error norm of relative error: %e\n", *errorNorm); @@ -709,13 +712,12 @@ void KernelStatisticsAdv(KernelData * kd, LatticeDesc * ld, CaseData * cd, int i fprintf(fh, "# Average density and average x velocity over each cross section in x direction. Snapshot taken at iteration %d.\n", iteration); fprintf(fh, "# Plot on terminal: gnuplot -e \"set terminal dumb; plot \\\"%s\\\" u 1:2; plot \\\"%s\\\" u 1:3;\"\n", fileName, fileName); -// fprintf(fh, "# Plot graphically: gnuplot -e \"plot \\\"%s\\\" u 1:3 w linesp t \\\"l\\\", \\\"\\\" u 1:4 w linesp t \\\"simulation\\\"; pause -1;" fprintf(fh, "# x, avg density, avg ux\n"); for (x = 0; x < nX; ++x) { - uxSum = 0.0; - densitySum = 0.0; + uxSum = F(0.0); + densitySum = F(0.0); nFluidNodes = 0; for (int y = 0; y < nY; ++y) { @@ -764,9 +766,9 @@ void KernelAddBodyForce(KernelData * kd, LatticeDesc * ld, CaseData * cd) int nY = kd->Dims[1]; int nZ = kd->Dims[2]; - PdfT w_0 = 1.0 / 3.0; // C - PdfT w_1 = 1.0 / 18.0; // N,S,E,W,T,B - PdfT w_2 = 1.0 / 36.0; // NE,NW,SE,SW,TE,TW,BE,BW,TN,TS,BN,BS + PdfT w_0 = F(1.0) / F( 3.0); // C + PdfT w_1 = F(1.0) / F(18.0); // N,S,E,W,T,B + PdfT w_2 = F(1.0) / F(36.0); // NE,NW,SE,SW,TE,TW,BE,BW,TN,TS,BN,BS PdfT w[] = {w_1,w_1,w_1,w_1,w_2,w_2,w_2,w_2,w_1,w_2,w_2,w_2,w_2,w_1,w_2,w_2,w_2,w_2,w_0}; PdfT xForce = cd->XForce; @@ -788,9 +790,9 @@ void KernelAddBodyForce(KernelData * kd, LatticeDesc * ld, CaseData * cd) // load pdfs into temp array kd->GetNode(kd, x, y, z, pdfs); - // add body force in x direction ( method by Luo) + // add body force in x direction (method by Luo) for (int d = 0; d < N_D3Q19; ++d) { - pdfs[d] = pdfs[d] + 3.0*w[d]*D3Q19_X[d]*xForce; + pdfs[d] = pdfs[d] + F(3.0) * w[d] * D3Q19_X[d] * xForce; } kd->SetNode(kd, x, y, z, pdfs); diff --git a/src/Kernel.h b/src/Kernel.h index 99e126b..05ada4e 100644 --- a/src/Kernel.h +++ b/src/Kernel.h @@ -60,9 +60,15 @@ #endif +#ifdef PRECISION_DP + typedef double PdfT; +#elif defined(PRECISION_SP) + typedef float PdfT; +#else + #error PRECISION must be defined as dp or sp. +#endif -typedef double PdfT; - + #define F(number) (PdfT)(number) #define D3Q19 diff --git a/src/KernelFunctions.h b/src/KernelFunctions.h index 557c653..6efadd9 100644 --- a/src/KernelFunctions.h +++ b/src/KernelFunctions.h @@ -30,6 +30,7 @@ #include "BenchKernelD3Q19.h" #include "BenchKernelD3Q19Aa.h" #include "BenchKernelD3Q19AaVec.h" +#include "BenchKernelD3Q19AaVecSl.h" #include "BenchKernelD3Q19List.h" #include "BenchKernelD3Q19ListAa.h" #include "BenchKernelD3Q19ListAaRia.h" @@ -149,8 +150,14 @@ KernelFunctions g_kernels[] = .Name = "aa-vec-soa", .Init = D3Q19AaVecInit_AaSoA, .Deinit = D3Q19AaVecDeinit_AaSoA + }, + { + .Name = "aa-vec-sl-soa", + .Init = D3Q19AaVecSlInit_AaSoA, + .Deinit = D3Q19AaVecSlDeinit_AaSoA } + }; #endif // __KERNEL_FUNCTIONS_H__ diff --git a/src/Lattice.h b/src/Lattice.h index c23b70f..97ef1f4 100644 --- a/src/Lattice.h +++ b/src/Lattice.h @@ -51,6 +51,7 @@ typedef struct LatticeDesc_ { int PeriodicX; // Periodic in X direction. int PeriodicY; // Periodic in Y direction. int PeriodicZ; // Periodic in Z direction. + const char * Name; // Geometry Name, points to statically allocated names, do not free! } LatticeDesc; diff --git a/src/Main.c b/src/Main.c index 65a4035..10dfbbe 100644 --- a/src/Main.c +++ b/src/Main.c @@ -133,14 +133,14 @@ int main(int argc, char * argv[]) CaseData cd; - cd.MaxIterations = 1000; - cd.RhoIn = 1.0; - cd.RhoOut = 1.0; - cd.Omega = 1.0; + cd.MaxIterations = 10; + cd.RhoIn = F(1.0); + cd.RhoOut = F(1.0); + cd.Omega = F(1.0); cd.VtkOutput = 0; cd.VtkModulus = 100; cd.StatisticsModulus = 100; - cd.XForce = 0.00001; + cd.XForce = F(0.00001); kernelToUse = "push-soa"; Parameters p; @@ -156,7 +156,7 @@ int main(int argc, char * argv[]) printf("This program comes with ABSOLUTELY NO WARRANTY; for details see LICENSE.\n"); printf("This is free software, and you are welcome to redistribute it under certain conditions.\n"); printf("\n"); - printf("LBM Benchmark Kernels %d.%d, compiled %s %s, type: %s\n", + printf("# LBM Benchmark Kernels %d.%d, compiled %s %s, type: %s\n", LBM_BENCH_KERNELS_VERSION_MAJOR, LBM_BENCH_KERNELS_VERSION_MINOR, __DATE__, __TIME__, #ifdef VERIFICATION "verification" @@ -208,40 +208,40 @@ int main(int argc, char * argv[]) else if (ARG_IS("-rho-in") ||ARG_IS("--rho-in")) { NEXT_ARG_PRESENT(); - cd.RhoIn = strtod(argv[++i], NULL); + cd.RhoIn = F(strtod(argv[++i], NULL)); } else if (ARG_IS("-rho-out") ||ARG_IS("--rho-out")) { NEXT_ARG_PRESENT(); - cd.RhoOut = strtod(argv[++i], NULL); + cd.RhoOut = F(strtod(argv[++i], NULL)); } else if (ARG_IS("-omega") ||ARG_IS("--omega")) { NEXT_ARG_PRESENT(); - cd.Omega = strtod(argv[++i], NULL); + cd.Omega = F(strtod(argv[++i], NULL)); } else if (ARG_IS("-x-force") ||ARG_IS("--x-force")) { NEXT_ARG_PRESENT(); - cd.XForce = strtod(argv[++i], NULL); + cd.XForce = F(strtod(argv[++i], NULL)); } else if (ARG_IS("-verify") || ARG_IS("--verify")) { #ifdef VERIFICATION // Choose this preset for verification. As geometry type "box" is - // used but x and y direction are made pridoc. + // used but x and y direction are made periodic. // Everything else can be altered, but enough iterations should be // performed in order to receive a fully developed flow field. verify = 1; - cd.Omega = 1.0; - cd.RhoIn = 1.0; - cd.RhoOut = 1.0; + cd.Omega = F(1.0); + cd.RhoIn = F(1.0); + cd.RhoOut = F(1.0); geometryType = "box"; dims[0] = 16; dims[1] = 16; dims[2] = 16; - cd.XForce = 0.00001; + cd.XForce = F(0.00001); cd.MaxIterations = 1000; periodic[0] = 1; periodic[1] = 1; @@ -407,11 +407,10 @@ int main(int argc, char * argv[]) omp_set_num_threads(nThreads); #endif - LatticeDesc ld; - - GeoCreateByStr(geometryType, dims, periodic, &ld); - const char * defines[] = { +#ifdef DEBUG + "DEBUG", +#endif #ifdef VTK_OUTPUT "VTK_OUTPUT", #endif @@ -426,41 +425,72 @@ int main(int argc, char * argv[]) #endif #ifdef HAVE_LIKWID "HAVE_LIKWID", +#endif +#ifdef INTEL_OPT_DIRECTIVES + "INTEL_OPT_DIRECTIVES", #endif }; - printf("# defines: "); + printf("#\n"); + +#ifdef PRECISION_DP + printf("# - floating point: double precision (%lu b, PRECISION_DP defined)\n", sizeof(PdfT)); +#elif defined(PRECISION_SP) + printf("# - floating point: single precision (%lu b, PRECISION_SP defined)\n", sizeof(PdfT)); +#else + printf("# - floating point: UNKNOWN (%lu b)\n", sizeof(PdfT)); +#endif + +#ifdef VECTOR_AVX + printf("# - intrinsics: AVX (VECTOR_AVX defined)\n"); +#elif defined(VECTOR_SSE) + printf("# - intrinsics: SSE (VECTOR_SSE defined)\n"); +#else + printf("# - intrinsics: UNKNOWN\n"); +#endif + + printf("# - defines: "); for (int j = 0; j < N_ELEMS(defines); ++j) { printf("%s ", defines[j]); } printf("\n"); - printf("# nodes total: % 10d\n", ld.nObst + ld.nFluid); - printf("# nodes fluid: % 10d (including inlet & outlet)\n", ld.nFluid); - printf("# nodes obstacles: % 10d\n", ld.nObst); - printf("# nodes inlet: % 10d\n", ld.nInlet); - printf("# nodes outlet: % 10d\n", ld.nOutlet); - printf("# periodicity: x: %d y: %d z: %d\n", ld.PeriodicX, ld.PeriodicY, ld.PeriodicZ); +#ifdef __x86_64__ + printf("# - fp status: DAZ: %d FTZ: %d\n", FpGetDaz(), FpGetFtz()); +#endif + + printf("# - iterations: %d\n", cd.MaxIterations); + + LatticeDesc ld; + + GeoCreateByStr(geometryType, dims, periodic, &ld); + + printf("# - geometry:\n"); + printf("# type: %s\n", ld.Name); + printf("# dimensions: %d x %d x %d (x, y, z)\n", ld.Dims[0], ld.Dims[1], ld.Dims[2]); + + printf("# nodes total: %d\n", ld.nObst + ld.nFluid); + printf("# nodes fluid: %d (including inlet & outlet)\n", ld.nFluid); + printf("# nodes obstacles: %d\n", ld.nObst); + printf("# nodes inlet: %d\n", ld.nInlet); + printf("# nodes outlet: %d\n", ld.nOutlet); + printf("# periodicity: x: %d y: %d z: %d\n", ld.PeriodicX, ld.PeriodicY, ld.PeriodicZ); #ifdef VTK_OUTPUT - printf("# VTK output: %d (every %d iteration)\n", cd.VtkOutput, cd.VtkModulus); + printf("# - VTK output: %d (every %d iteration)\n", cd.VtkOutput, cd.VtkModulus); #endif #ifdef STATISTICS - printf("# statistics: every %d iteration\n", cd.StatisticsModulus); + printf("# - statistics: every %d iteration\n", cd.StatisticsModulus); #endif - printf("# omega: %f\n", cd.Omega); - printf("# initial density at inlet/outlet:\n"); - printf("# rho in: %e\n", cd.RhoIn); - printf("# rho out: %e\n", cd.RhoOut); - printf("# iterations: %d\n", cd.MaxIterations); - -#ifdef __x86_64__ - printf("# fp status: DAZ: %d FTZ: %d\n", FpGetDaz(), FpGetFtz()); -#endif + printf("# - flow:\n"); + printf("# omega: %f\n", cd.Omega); + printf("# initial density at inlet/outlet:\n"); + printf("# rho in: %e\n", cd.RhoIn); + printf("# rho out: %e\n", cd.RhoOut); #ifdef _OPENMP - printf("# OpenMP threads: %d\n", omp_get_max_threads()); + printf("# - OpenMP threads: %d\n", omp_get_max_threads()); if (pinString != NULL) { #pragma omp parallel @@ -482,7 +512,7 @@ int main(int argc, char * argv[]) #pragma omp for ordered for (int i = 0; i < omp_get_num_threads(); ++i) { #pragma omp ordered - printf("# thread %2d pinned to core(s): %s\n", threadId, cpuList); + printf("# thread %2d pinned to core(s): %s\n", threadId, cpuList); } free((void *)cpuList); @@ -513,7 +543,7 @@ int main(int argc, char * argv[]) } printf("#\n"); - printf("# kernel: %s\n", kf->Name); + printf("# - kernel: %s\n", kf->Name); printf("#\n"); // Initialize kernel by calling its own initialization function @@ -553,13 +583,20 @@ int main(int argc, char * argv[]) double perf = (double)ld.nFluid * (double)cd.MaxIterations / duration / 1.e6; - printf("P: %f MFLUP/s t: %d d: %f s iter: %d fnodes: %f x1e6 geo: %s kernel: %s %s\n", + printf("P: %f MFLUP/s t: %d d: %f s iter: %d fnodes: %f x1e6 geo: %s kernel: %s %s %s\n", perf, nThreads, duration, cd.MaxIterations, ld.nFluid / 1e6, geometryType, kernelToUse, #ifdef VERIFICATION - "VERIFICATION" + "VERIFICATION", +#else + "B", +#endif +#ifdef PRECISION_DP + "dp" +#elif defined(PRECISION_SP) + "sp" #else - "B" + "unknown-precision" #endif ); diff --git a/src/Makefile b/src/Makefile index 99ca902..52f0bf8 100644 --- a/src/Makefile +++ b/src/Makefile @@ -58,6 +58,9 @@ ISA ?= avx LIKWID ?= off +# Which floating point precision to use: dp (double precision) or sp (single preicision) +PRECISION ?= dp + # Global settings for the Makefile SHELL = sh @@ -91,8 +94,8 @@ SED = sed # Where to store objects and dependency files. -OBJECT_DIR = obj/$(CONFIG)-$(BUILD)$(TAG) -DEP_DIR = obj/$(CONFIG)-$(BUILD)$(TAG)-dep +OBJECT_DIR = obj/$(CONFIG)-$(BUILD)$(PREC)$(TAG) +DEP_DIR = obj/$(CONFIG)-$(BUILD)$(PREC)$(TAG)-dep # Sources to consider. SOURCES_C = Main.c Memory.c Geometry.c Kernel.c \ @@ -134,7 +137,9 @@ OBJ_C = $(foreach SOURCE,$(SOURCES_C),$(OBJECT_DIR)/$(SOURCE:%.c=%.o)) \ $(OBJECT_DIR)/BenchKernelD3Q19Aa_AaSoA.o \ $(OBJECT_DIR)/BenchKernelD3Q19AaCommon_AaSoA.o \ $(OBJECT_DIR)/BenchKernelD3Q19AaVec_AaSoA.o \ - $(OBJECT_DIR)/BenchKernelD3Q19AaVecCommon_AaSoA.o + $(OBJECT_DIR)/BenchKernelD3Q19AaVecCommon_AaSoA.o \ + $(OBJECT_DIR)/BenchKernelD3Q19AaVecSl_AaSoA.o \ + $(OBJECT_DIR)/BenchKernelD3Q19AaVecSlCommon_AaSoA.o OBJ = $(OBJ_C) @@ -210,6 +215,19 @@ ifeq (on,$(LIKWID)) LD_LIBS += $(LIKWID_LIB) -llikwid endif + +ifeq (dp,$(PRECISION)) + PP_FLAGS += $(D)PRECISION_DP + PREC=-dp +else +ifeq (sp,$(PRECISION)) + PP_FLAGS += $(D)PRECISION_SP + PREC=-sp +else + $(error PRECISION is only be allowed to be sp (single precision) or dp (doble precision)) +endif +endif + # ARCH can only be assigned a string without a space. The space is escaped as # a comma which we have to replace here. @@ -225,8 +243,12 @@ endif .phony: all clean clean-all -$(info $(shell $(ECHO_E) "# Configuration: CONFIG=$(COLOR_CYAN)$(CONFIG)$(COLOR_NO) BUILD=$(COLOR_CYAN)$(BUILD)$(COLOR_NO) VERIFICATION=$(COLOR_CYAN)$(VERIFICATION)$(COLOR_NO) STATISTICS=$(COLOR_CYAN)$(STATISTICS)$(COLOR_NO) VTK_OUTPUT=$(COLOR_CYAN)$(VTK_OUTPUT)$(COLOR_NO) OPENMP=$(COLOR_CYAN)$(OPENMP)$(COLOR_NO) ISA=$(COLOR_CYAN)$(ISA)$(COLOR_NO) LIKWID=$(COLOR_CYAN)$(LIKWID)$(COLOR_NO) TARCH=$(COLOR_CYAN)$(TARCH)$(COLOR_NO) building $(.DEFAULT_GOAL)...")) +#$(info $(shell $(ECHO_E) "# Configuration: CONFIG=$(COLOR_CYAN)$(CONFIG)$(COLOR_NO) BUILD=$(COLOR_CYAN)$(BUILD)$(COLOR_NO) VERIFICATION=$(COLOR_CYAN)$(VERIFICATION)$(COLOR_NO) STATISTICS=$(COLOR_CYAN)$(STATISTICS)$(COLOR_NO) VTK_OUTPUT=$(COLOR_CYAN)$(VTK_OUTPUT)$(COLOR_NO) OPENMP=$(COLOR_CYAN)$(OPENMP)$(COLOR_NO) ISA=$(COLOR_CYAN)$(ISA)$(COLOR_NO) LIKWID=$(COLOR_CYAN)$(LIKWID)$(COLOR_NO) TARCH=$(COLOR_CYAN)$(TARCH)$(COLOR_NO) building $(.DEFAULT_GOAL)...")) +$(info $(shell $(ECHO_E) "# Configuration: CONFIG=$(COLOR_CYAN)$(CONFIG)$(COLOR_NO) BUILD=$(COLOR_CYAN)$(BUILD)$(COLOR_NO) PRECISION=$(COLOR_CYAN)$(PRECISION)$(COLOR_NO)")) +$(info $(shell $(ECHO_E) "# OPENMP=$(COLOR_CYAN)$(OPENMP)$(COLOR_NO) ISA=$(COLOR_CYAN)$(ISA)$(COLOR_NO) LIKWID=$(COLOR_CYAN)$(LIKWID)$(COLOR_NO)")) +$(info $(shell $(ECHO_E) "# VERIFICATION=$(COLOR_CYAN)$(VERIFICATION)$(COLOR_NO) STATISTICS=$(COLOR_CYAN)$(STATISTICS)$(COLOR_NO) VTK_OUTPUT=$(COLOR_CYAN)$(VTK_OUTPUT)$(COLOR_NO)")) +$(info $(shell $(ECHO_E) "# target=$(.DEFAULT_GOAL)")) $(info # Object dir: $(OBJECT_DIR)) $(info # Dependency dir: $(DEP_DIR)) @@ -234,7 +256,7 @@ $(info # Dependency dir: $(DEP_DIR)) BIN_DIR=../bin -all: $(BIN_DIR)/lbmbenchk-$(CONFIG)-$(BUILD)$(BUILD_CONFIG)$(TAG) +all: $(BIN_DIR)/lbmbenchk-$(CONFIG)-$(BUILD)$(BUILD_CONFIG)$(PREC)$(TAG) # ------------------------------------------------------------------------ @@ -255,11 +277,14 @@ all: $(BIN_DIR)/lbmbenchk-$(CONFIG)-$(BUILD)$(BUILD_CONFIG)$(TAG) $(BIN_DIR): [ -d "$@" ] || mkdir -p "$@" -$(BIN_DIR)/lbmbenchk-$(CONFIG)-$(BUILD)$(BUILD_CONFIG)$(TAG): $(OBJ) $(REBUILD_DEPS) $(DEP_DIR)/.target | $(BIN_DIR) +$(BIN_DIR)/lbmbenchk-$(CONFIG)-$(BUILD)$(BUILD_CONFIG)$(PREC)$(TAG): $(OBJ) $(REBUILD_DEPS) $(DEP_DIR)/.target | $(BIN_DIR) @$(ECHO_E) "linking: $(COLOR_CYAN)$@$(COLOR_NO)" $(LD) $(LD_FLAGS) -o $@ $(filter-out $(REBUILD_DEPS),$^) $(LD_LIBS) @$(ECHO_E) "# Builded binary: $(COLOR_CYAN)$@$(COLOR_NO)" - @$(ECHO_E) "# Configuration was: CONFIG=$(COLOR_CYAN)$(CONFIG)$(COLOR_NO) BUILD=$(COLOR_CYAN)$(BUILD)$(COLOR_NO) VERIFICATION=$(COLOR_CYAN)$(VERIFICATION)$(COLOR_NO) STATISTICS=$(COLOR_CYAN)$(STATISTICS)$(COLOR_NO) VTK_OUTPUT=$(COLOR_CYAN)$(VTK_OUTPUT)$(COLOR_NO) OPENMP=$(COLOR_CYAN)$(OPENMP)$(COLOR_NO) ISA=$(COLOR_CYAN)$(ISA)$(COLOR_NO) LIKWID=$(COLOR_CYAN)$(LIKWID)$(COLOR_NO) target=$(.DEFAULT_GOAL)" + @$(ECHO_E) "# Configuration was: CONFIG=$(COLOR_CYAN)$(CONFIG)$(COLOR_NO) BUILD=$(COLOR_CYAN)$(BUILD)$(COLOR_NO) PRECISION=$(COLOR_CYAN)$(PRECISION)$(COLOR_NO)" + @$(ECHO_E) "# OPENMP=$(COLOR_CYAN)$(OPENMP)$(COLOR_NO) ISA=$(COLOR_CYAN)$(ISA)$(COLOR_NO) LIKWID=$(COLOR_CYAN)$(LIKWID)$(COLOR_NO)" + @$(ECHO_E) "# VERIFICATION=$(COLOR_CYAN)$(VERIFICATION)$(COLOR_NO) STATISTICS=$(COLOR_CYAN)$(STATISTICS)$(COLOR_NO) VTK_OUTPUT=$(COLOR_CYAN)$(VTK_OUTPUT)$(COLOR_NO)" + @$(ECHO_E) "# target=$(.DEFAULT_GOAL)" $(OBJECT_DIR)/%_SoA.o: %.c $(REBUILD_DEPS) @$(ECHO_E) "compiling: $(COLOR_CYAN)$@$(COLOR_NO) $(COLOR_MAGENTA)DATA_LAYOUT_SOA$(COLOR_NO)" @@ -308,7 +333,7 @@ $(DEP_DIR)/.target: # ------------------------------------------------------------------------ # Current configuration. -MAKE_CFG = SYSTEM=$(SYSTEM) // BUILD=$(BUILD) // MAKEOVERRIDES=\"$(strip $(MAKEOVERRIDES))\" // VERIFICATION=$(VERIFICATION) // STATISTICS=$(STATISTICS) // VTK_OUTPUT=$(VTK_OUTPUT) // VTK_OUTPUT_ASCII=$(VTK_OUTPUT_ASCII) // LID_DRIVEN_CAVITY=$(LID_DRIVEN_CAVITY) // ISA=$(ISA) // LIKWID=$(LIKWID) +MAKE_CFG = SYSTEM=$(SYSTEM) // BUILD=$(BUILD) // MAKEOVERRIDES=\"$(strip $(MAKEOVERRIDES))\" // VERIFICATION=$(VERIFICATION) // STATISTICS=$(STATISTICS) // VTK_OUTPUT=$(VTK_OUTPUT) // VTK_OUTPUT_ASCII=$(VTK_OUTPUT_ASCII) // LID_DRIVEN_CAVITY=$(LID_DRIVEN_CAVITY) // ISA=$(ISA) // LIKWID=$(LIKWID) // PRECISION=$(PRECISION) # Compare current configuration to the last one so we know when to # rebuild this system/target despite when sources have not changed. diff --git a/src/Vector.h b/src/Vector.h index 41b9a79..af12f77 100644 --- a/src/Vector.h +++ b/src/Vector.h @@ -36,48 +36,104 @@ #error Only VECTOR_AVX or VECTOR_SSE can be defined at the same time. #endif -#ifdef VECTOR_AVX +#if !defined(PRECISION_DP) && !defined(PRECISION_SP) + #error PRECISION_DP or PRECISION_SP must be defined. +#endif - #include - // Vector size in double-precision floatin-point numbers. - #define VSIZE 4 +#if defined(PRECISION_DP) && defined(PRECISION_SP) + #error Only PRECISION_DP or PRECISION_SP can be defined at the same time. +#endif - #define VPDFT __m256d +#ifdef PRECISION_DP - #define VSET(scalar) _mm256_set1_pd(scalar) + #ifdef VECTOR_AVX - #define VLD(expr) _mm256_load_pd(expr) - #define VLDU(expr) _mm256_loadu_pd(expr) + #include + // Vector size in double-precision floating-point numbers. + #define VSIZE 4 - #define VST(dst, src) _mm256_store_pd(dst, src) - #define VSTU(dst, src) _mm256_storeu_pd(dst, src) - #define VSTNT(dst, src) _mm256_stream_pd(dst, src) + #define VPDFT __m256d - #define VMUL(a, b) _mm256_mul_pd(a, b) - #define VADD(a, b) _mm256_add_pd(a, b) - #define VSUB(a, b) _mm256_sub_pd(a, b) -#endif + #define VSET(scalar) _mm256_set1_pd(scalar) -#ifdef VECTOR_SSE - #include - // Vector size in double-precision floatin-point numbers. - #define VSIZE 2 + #define VLD(expr) _mm256_load_pd(expr) + #define VLDU(expr) _mm256_loadu_pd(expr) - #define VPDFT __m128d + #define VST(dst, src) _mm256_store_pd(dst, src) + #define VSTU(dst, src) _mm256_storeu_pd(dst, src) + #define VSTNT(dst, src) _mm256_stream_pd(dst, src) - #define VSET(scalar) _mm_set1_pd(scalar) + #define VMUL(a, b) _mm256_mul_pd(a, b) + #define VADD(a, b) _mm256_add_pd(a, b) + #define VSUB(a, b) _mm256_sub_pd(a, b) + #endif - #define VLD(expr) _mm_load_pd(expr) - #define VLDU(expr) _mm_loadu_pd(expr) + #ifdef VECTOR_SSE + #include + // Vector size in double-precision floating-point numbers. + #define VSIZE 2 - #define VST(dst, src) _mm_store_pd(dst, src) - #define VSTU(dst, src) _mm_storeu_pd(dst, src) - #define VSTNT(dst, src) _mm_stream_pd(dst, src) + #define VPDFT __m128d - #define VMUL(a, b) _mm_mul_pd(a, b) - #define VADD(a, b) _mm_add_pd(a, b) - #define VSUB(a, b) _mm_sub_pd(a, b) -#endif + #define VSET(scalar) _mm_set1_pd(scalar) + + #define VLD(expr) _mm_load_pd(expr) + #define VLDU(expr) _mm_loadu_pd(expr) + + #define VST(dst, src) _mm_store_pd(dst, src) + #define VSTU(dst, src) _mm_storeu_pd(dst, src) + #define VSTNT(dst, src) _mm_stream_pd(dst, src) + + #define VMUL(a, b) _mm_mul_pd(a, b) + #define VADD(a, b) _mm_add_pd(a, b) + #define VSUB(a, b) _mm_sub_pd(a, b) + #endif + +#elif defined(PRECISION_SP) + + #ifdef VECTOR_AVX + + #include + // Vector size in double-precision floating-point numbers. + #define VSIZE 8 + + #define VPDFT __m256 + + #define VSET(scalar) _mm256_set1_ps(scalar) + + #define VLD(expr) _mm256_load_ps(expr) + #define VLDU(expr) _mm256_loadu_ps(expr) + + #define VST(dst, src) _mm256_store_ps(dst, src) + #define VSTU(dst, src) _mm256_storeu_ps(dst, src) + #define VSTNT(dst, src) _mm256_stream_ps(dst, src) + + #define VMUL(a, b) _mm256_mul_ps(a, b) + #define VADD(a, b) _mm256_add_ps(a, b) + #define VSUB(a, b) _mm256_sub_ps(a, b) + #endif + + #ifdef VECTOR_SSE + #include + // Vector size in double-precision floating-point numbers. + #define VSIZE 4 + + #define VPDFT __m128 + + #define VSET(scalar) _mm_set1_ps(scalar) + + #define VLD(expr) _mm_load_ps(expr) + #define VLDU(expr) _mm_loadu_ps(expr) + + #define VST(dst, src) _mm_store_ps(dst, src) + #define VSTU(dst, src) _mm_storeu_ps(dst, src) + #define VSTNT(dst, src) _mm_stream_ps(dst, src) + + #define VMUL(a, b) _mm_mul_ps(a, b) + #define VADD(a, b) _mm_add_ps(a, b) + #define VSUB(a, b) _mm_sub_ps(a, b) + #endif +#endif // PRECISION #endif // __VECTOR_H__ diff --git a/src/test.sh b/src/test.sh index a0e8ad6..121de49 100755 --- a/src/test.sh +++ b/src/test.sh @@ -30,6 +30,9 @@ set -e XTag="-test" +# How many parallel processes during make. +NProc="10" + Build=release if [ "$#" -lt 1 ]; then @@ -53,11 +56,54 @@ fi Config="$1" make clean-all -make -j CONFIG=$Config TAG=$XTag-debug -make -j CONFIG=$Config BUILD=$Build TAG=$XTag-v VERIFICATION=on -make -j CONFIG=$Config BUILD=$Build TAG=$XTag-b BENCHMARK=on -BinaryV="../bin/lbmbenchk-$Config-$Build$XTag-v" -BinaryB="../bin/lbmbenchk-$Config-$Build$XTag-b" +make -j $NProc PRECISION=dp CONFIG=$Config TAG=$XTag-debug +make -j $NProc PRECISION=dp CONFIG=$Config BUILD=$Build TAG=$XTag-v VERIFICATION=on +make -j $NProc PRECISION=dp CONFIG=$Config BUILD=$Build TAG=$XTag-b BENCHMARK=on + +BinaryVDp="../bin/lbmbenchk-$Config-$Build-dp$XTag-v" +BinaryBDp="../bin/lbmbenchk-$Config-$Build-dp$XTag-b" + + +make -j $NProc PRECISION=sp CONFIG=$Config TAG=$XTag-debug +make -j $NProc PRECISION=sp CONFIG=$Config BUILD=$Build TAG=$XTag-v VERIFICATION=on +make -j $NProc PRECISION=sp CONFIG=$Config BUILD=$Build TAG=$XTag-b BENCHMARK=on + +BinaryVSp="../bin/lbmbenchk-$Config-$Build-sp$XTag-v" +BinaryBSp="../bin/lbmbenchk-$Config-$Build-sp$XTag-b" + + +echo "#" +echo "# [test.sh] ./test-verification.sh \"$BinaryVDp\"" +echo "#" + +./test-verification.sh "$BinaryVDp" + +ExitCodeDp="$?" + +echo "#" +echo "# [test.sh] ./test-verification.sh \"$BinaryVSp\"" +echo "#" + +./test-verification.sh "$BinaryVSp" + +ExitCodeSp="$?" + +ResultDp="errors occurred" +ResultSp="errors occurred" + +if [ "$ExitCodeDp" == "0" ]; then ResultDp="OK"; fi +if [ "$ExitCodeSp" == "0" ]; then ResultSp="OK"; fi + +echo "#" +echo "# [test.sh] test double precision: $ResultDp single precision: $ResultSp" +echo "#" + +ExitCode="0" + +if [ "$ExitCodeDp" != 0 -o "$ExitCodeSp" != 0 ]; then + ExitCode="1" +fi + +exit "$ExitCode" -./test-verification.sh "$BinaryV"