Little follow-up to !233 (merged) after I thought about it again.
fix the aligned version (it was using maskStore in some instruction sets and maskStoreA in others)
make sure the test case is incommensurate with the vector width (previously it couldn't distinguish store from storeMask on 128-bit vector instruction sets)
implement a fallback for instruction sets that don't support it natively (turns out this is really easy using a load-blend-store combination)