WIP: Cuda autotune
This PR introduces two one changes:
rotate (32,1,1) depending on field strides to fastest dimension. So (1,1,32) for c-layout and (32,1,1) for fortran layout. So pystencils will be fast also for c-layout (this will always be performed)- auto-tune the block dimensions to whatevers is fastest for a specific kernel on localhost. On first kernel call different layouts are tried and the kernel will be called henceforth with the fastest configuration (disk_cached). This could be intersting for OpenCL where we don't know which launch config is the fastest (on OpenCL the runtime can alternatively give a hint on that).
One drawback: the test calls are only correct if input and output fields do not overlap (so no in-place kernels).
Edited by Stephan Seitz
Merge request reports
Activity
Filter activity
Ok, I forgot that this function already exists: https://i10git.cs.fau.de/seitz/pystencils/blob/e25c266c97247f38d1bd3d2146dd230073a72018/pystencils/gpucuda/indexing.py#L219-232
added 1 commit
- 64e0dc69 - Also add AUTOTUNE_BLOCK_SIZES to LineIndexing
added 1 commit
- d04b4789 - Add 'cuda' compiler config (with preferred_block_size and always_autotune)
added 1 commit
- ba7b20ac - Add 'cuda' compiler config (with preferred_block_size and always_autotune)
added 1 commit
- 98291ce5 - Change block_size -> block_and_thread_numbers
Please register or sign in to reply