Improve non-temporal stores

assigned to @holzer

changed the description

Non-temporal stores are really complicated, I think. In fact, I did some Benchmarks using a homogenous 27-point stencil on an AMD EPYC 7452 “Rome” CPU (32 cores + SMT). It turns out that using NT Stores reduces the single-core performance by more than a factor of 2. Additionally, it could be shown that a simple benchmark kernel for NT Stores performs extremely well on that machine. It performs so well that a reason for the bad performance of our 27-point stencil might be that this instruction is completely out of sync, which introduces significant overhead in our benchmark.

However, despite showing a poor single-core performance using NT stores reduces the amount of data needed for the kernel. This means that the kernel still might end up with higher performance when scaling it up to the full NUMA domain, and thus, using NT stores still would make sense. In the end, on a multicore processor, what we want is a saturation of the bandwidth when using a full NUMA domain because this is as good as we can get. If a poor-performing single-core kernel can still do this and needs less memory overall, then it wins.

But all these things are subject to careful benchmarking and thus remain interesting for every architecture.

This means that the kernel still might end up with higher performance when scaling it up to the full NUMA domain, and thus, using NT stores still would make sense.

Agreed. It's about 10 percent slower here on one core and while previously it saturated between 2 and 3 cores, it doesn't saturate anymore even with 4 cores. But this is a CPU specifically designed for laptops and I would expect HPC CPUs to behave more traditionally. As you said, it's always necessary to benchmark.

added 1 commit

03a12c7a - remove stream from instruction sets that don't have it

Compare with previous version

added 1 commit

8f92d147 - remove stream from instruction sets that don't have it

Compare with previous version

added 1 commit

1d39f5f5 - logic fix

Compare with previous version

changed the description

mentioned in commit 147f6901

merged

mentioned in commit c5712bcb

mentioned in merge request !231 (merged)

mentioned in commit 589ca872

mentioned in merge request walberla/walberla!448 (merged)

mentioned in merge request !242 (merged)

Improve non-temporal stores

Merged by Markus Holzer 4 years ago (Apr 12, 2021 8:40am UTC) 4 years ago

Activity

Improve non-temporal stores

Merge request reports

Merged by Markus Holzer 4 years ago (Apr 12, 2021 8:40am UTC) 4 years ago

Activity