Non-temporal stores do not use fences
When vectorization is enabled, instructions like _mm(|256|512)_stream_p[sd]
are generated. However, the corresponding fence _mm_mfence
is never generated. This is not a problem in practice as enough time will have passed by the time the data is next read. However, an explicit fence should be added to guarantee safety.
Edited by Michael Kuron