Skip to content
Snippets Groups Projects

Vector scatter/gather support

Merged Michael Kuron requested to merge scattergather into master

Some modern processors support scatter/gather operations in hardware to be able to vectorize even when the stride between consecutive elements is not 1. Supporting this in pystencils turned out to be surprisingly easy. On a Core i7-7820X, the D3Q19 TRT benchmark shows an appreciable performance benefit for cases with nonideal memory layout:

  • 15% for fzyx without assume_inner_stride_one and with split
  • 20% for fzyx without assume_inner_stride_one
  • 30% for zyxf

AVX2 only supported gather, so this requires AVX512. The internet says it was quite slow on AVX2 processors anyway. Even on AVX512 the latency is quite high and the throughput is quite low, but it's still better than not vectorizing. SVE also supports it, so all future ARM processors will benefit too, and they will probably have better hardware support for higher throughput.

Fixes #34 (closed)

Edited by Michael Kuron

Merge request reports

Pipeline #31780 passed

Pipeline passed for a946d58e on scattergather

Test coverage 88.13% (-0.06%) from 1 job

Merged by Markus HolzerMarkus Holzer 3 years ago (May 2, 2021 1:32pm UTC)

Loading

Pipeline #31863 passed

Pipeline passed for 8f72741d on master

Test coverage 88.05% (-0.06%) from 1 job

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
Please register or sign in to reply