Skip to content
Snippets Groups Projects

WIP: Cuda autotune

Closed Stephan Seitz requested to merge seitz/pystencils:cuda-autotune into master

This PR introduces two one changes:

  • rotate (32,1,1) depending on field strides to fastest dimension. So (1,1,32) for c-layout and (32,1,1) for fortran layout. So pystencils will be fast also for c-layout (this will always be performed)
  • auto-tune the block dimensions to whatevers is fastest for a specific kernel on localhost. On first kernel call different layouts are tried and the kernel will be called henceforth with the fastest configuration (disk_cached). This could be intersting for OpenCL where we don't know which launch config is the fastest (on OpenCL the runtime can alternatively give a hint on that).

One drawback: the test calls are only correct if input and output fields do not overlap (so no in-place kernels).

Edited by Stephan Seitz

Merge request reports

Pipeline #20527 passed

Pipeline passed for 98291ce5 on seitz:cuda-autotune

Approval is optional

Closed by Stephan SeitzStephan Seitz 4 years ago (Oct 7, 2020 11:04am UTC)

Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
Please register or sign in to reply