+Kernels
+-------
+
+The following list shortly describes available kernels:
+
+- push-soa/push-aos/pull-soa/pull-aos:
+ Unoptimized kernels (but stream/collide are already fused) using two grids as
+ source and destination. Implement push/pull semantics as well structure of
+ arrays (soa) or array of structures (aos) layout.
+
+- blk-push-soa/blk-push-aos/blk-pull-soa/blk-pull-aos:
+ The same as the unoptimized kernels without the blk prefix, except that they support
+ spatial blocking, i.e. loop blocking of the three loops used to iterate over
+ the lattice. Here manual work sharing for OpenMP is used.
+
+- list-push-soa/list-push-aos/list-pull-soa/list-pull-aos:
+ The same as the unoptimized kernels without the list prefix, but for indirect addressing.
+ Here only a 1D vector of is used to store the fluid nodes, omitting the
+ obstacles. An adjacency list is used to recover the neighborhood associations.
+
+- list-pull-split-nt-1s-soa/list-pull-split-nt-2s-soa:
+ Optimized variant of list-pull-soa. Chunks of the lattice are processed as
+ once. Postcollision values are written back via nontemporal stores in 18 (1s)
+ or 9 (2s) loops.
+
+- list-aa-aos/list-aa-soa:
+ Unoptimized implementation of the AA pattern for the 1D vector with adjacency
+ list. Supported are array of structures (aos) and structure of arrays (soa)
+ data layout is supported.
+
+- list-aa-ria-soa:
+ Implementation of AA pattern with intrinsics for the 1D vector with adjacency
+ list. Furthermore it contains a vectorized even time step and run length
+ coding to reduce the loop balance of the odd time step.
+
+- list-aa-pv-soa:
+ All optimizations of list-aa-ria-soa. Additional with partial vectorization
+ of the odd time step.
+
+
+Note that all array of structures (aos) kernels might require blocking
+(depending on the domain size) to reach the performance of their structure of
+arrays (soa) counter parts.
+
+The following table summarizes the properties of the kernels. Here **D** means
+direct addressing, i.e. full array, **I** means indirect addressing, i.e. 1D
+vector with adjacency list, **x** means supported, whereas **--** means unsupported.
+The loop balance B_l is computed for D3Q19 model with double precision floating
+point for PDFs (8 byte) and 4 byte integers for the index (adjacency list).
+As list-aa-ria-soa and list-aa-pv-soa support run length coding their effective
+loop balance depends on the geometry. The effective loop balance is printed
+during each run.
+
+
+====================== =========== =========== ===== ======== ======== ============
+kernel name prop. step data layout addr. parallel blocking B_l [B/FLUP]
+====================== =========== =========== ===== ======== ======== ============
+push-soa OS SoA D x -- 456
+push-aos OS AoS D x -- 456
+pull-soa OS SoA D x -- 456
+pull-aos OS AoS D x -- 456
+blk-push-soa OS SoA D x x 456
+blk-push-aos OS AoS D x x 456
+blk-pull-soa OS SoA D x x 456
+blk-pull-aos OS AoS D x x 456
+list-push-soa OS SoA I x x 528
+list-push-aos OS AoS I x x 528
+list-pull-soa OS SoA I x x 528
+list-pull-aos OS AoS I x x 528
+list-pull-split-nt-1s OS SoA I x x 376
+list-pull-split-nt-2s OS SoA I x x 376
+list-aa-soa AA SoA I x x 340
+list-aa-aos AA AoS I x x 340
+list-aa-ria-soa AA SoA I x x 304-342
+list-aa-pv-soa AA SoA I x x 304-342
+====================== =========== =========== ===== ======== ======== ============