AES-NI vectorization improvements
!30 (merged) didn't implement an SSE-vectorized _mm_cvtepu64_pd
equivalent because the stackoverflow solution didn't work. That turned out to be due to a bad optimization in GCC 5+ in fast-math mode. None of the other compilers (Clang, Intel, MSVC) have that issue, so we just disable fast-math for that function.
Also, we now use fused multiply-add if available.