Improve non-temporal stores
ARM doesn't have real non-temporal stores like x86 does. It does, however, have a special instruction (dc zva
) that sets a cacheline to zero without reading it from memory, thus saving memory bandwidth. This pull request makes use of it.
The other important part of non-temporal stores is that they don't pollute the cache. Some architectures like PowerPC can emulate that with a special store instruction that flushes a cache line from cache. It's not available on ARM, but this pull requests adds support anyway.
Furthermore, this pull request adds fence instructions for non-temporal stores on x86. This ensures that everything has actually arrived in memory before the kernel returns, thus ensuring that subsequent memory accesses will not get stale data. This was not an issue in practice as the overhead of exiting and calling a kernel would have usually taken enough time for the data to arrive in memory, but it wasn't guaranteed.
Note that non-temporal stores are not guaranteed to be faster than regular stores on all processors. For example, on my Apple M1, the non-temporal stores are actually slightly slower (so I did not gain anything by implementing this). This is actually not due to the overhead of checking whether a vector is the first one in a cache line (which can easily be confirmed by replacing dc zva
with nop
), but likely an artifact of Apple implementing dc zva
differently than the ARM documentation suggests. So on ARM, one should always compare regular and non-temporal stores and use whichever is faster.
Fixes #25 (closed). Supersedes !225 (closed).
Merge request reports
Activity
assigned to @holzer
Non-temporal stores are really complicated, I think. In fact, I did some Benchmarks using a homogenous 27-point stencil on an AMD EPYC 7452 “Rome” CPU (32 cores + SMT). It turns out that using NT Stores reduces the single-core performance by more than a factor of 2. Additionally, it could be shown that a simple benchmark kernel for NT Stores performs extremely well on that machine. It performs so well that a reason for the bad performance of our 27-point stencil might be that this instruction is completely out of sync, which introduces significant overhead in our benchmark.
However, despite showing a poor single-core performance using NT stores reduces the amount of data needed for the kernel. This means that the kernel still might end up with higher performance when scaling it up to the full NUMA domain, and thus, using NT stores still would make sense. In the end, on a multicore processor, what we want is a saturation of the bandwidth when using a full NUMA domain because this is as good as we can get. If a poor-performing single-core kernel can still do this and needs less memory overall, then it wins.
But all these things are subject to careful benchmarking and thus remain interesting for every architecture.
This means that the kernel still might end up with higher performance when scaling it up to the full NUMA domain, and thus, using NT stores still would make sense.
Agreed. It's about 10 percent slower here on one core and while previously it saturated between 2 and 3 cores, it doesn't saturate anymore even with 4 cores. But this is a CPU specifically designed for laptops and I would expect HPC CPUs to behave more traditionally. As you said, it's always necessary to benchmark.
added 1 commit
- 03a12c7a - remove stream from instruction sets that don't have it
added 1 commit
- 8f92d147 - remove stream from instruction sets that don't have it
mentioned in commit 147f6901
mentioned in commit c5712bcb
mentioned in merge request !231 (merged)
mentioned in commit 589ca872
mentioned in merge request walberla/walberla!448 (merged)
mentioned in merge request !242 (merged)