Vector scatter/gather support
Some modern processors support scatter/gather operations in hardware to be able to vectorize even when the stride between consecutive elements is not 1. Supporting this in pystencils turned out to be surprisingly easy. On a Core i7-7820X, the D3Q19 TRT benchmark shows an appreciable performance benefit for cases with nonideal memory layout:
- 15% for fzyx without assume_inner_stride_one and with split
- 20% for fzyx without assume_inner_stride_one
- 30% for zyxf
AVX2 only supported gather, so this requires AVX512. The internet says it was quite slow on AVX2 processors anyway. Even on AVX512 the latency is quite high and the throughput is quite low, but it's still better than not vectorizing. SVE also supports it, so all future ARM processors will benefit too, and they will probably have better hardware support for higher throughput.
Fixes #34 (closed)
Merge request reports
Activity
enabled an automatic merge when the pipeline for a946d58e succeeds
I came up with a nice little visualization to show the performance gains for different vectorization parameter choices. When you hover over one of the data points, it will show you the parameters. Unfortunately, the cluster I ran this on is a bit noisy (some measurements have big error bars of multiple MLUPS) and won't let me manually set CPU frequencies. You can still see though that simulations with
assume_inner_stride_one=False
orlayout="zyxf"
benefit significantly. Note that none of the sub-100% outliers even had their code changed by my merge request, it's just pure noise.Edited by Michael Kuronmentioned in commit 8f72741d
Here is another plot from a Core i7-7820X, on which I disabled turbo boost and set to a fixed 3.5 GHz (the maximum AVX512 frequency). Again, there is nothing systematic to the data points below 100%. Most error bars are smaller now.
Edited by Michael Kuronmentioned in merge request !345 (merged)