Skip to content
Snippets Groups Projects
Commit 9ba7bd25 authored by Frederik Hennig's avatar Frederik Hennig
Browse files

update GPU user guide; fix some other details in docs

parent c306be59
No related branches found
No related tags found
1 merge request!449GPU Indexing Schemes and Launch Configurations
Pipeline #74081 passed
(contribution_guide)=
# Contribution Guide
Welcome to the Contributor's Guide to pystencils!
......
......@@ -38,12 +38,14 @@ The following feature sets are available:
If you are developing pystencils, we recommend you perform an editable install of your
local clone of the repository, with all optional features:
```bash
pip install -e pystencils[alltrafos,interactive,use_cython,doc,tests]
pip install -e pystencils[alltrafos,interactive,use_cython,doc,testsuite]
```
This includes the additional feature groups `doc`, which contains all dependencies required
to build this documentation, and `tests`, which adds `flake8` for code style checking,
`mypy` for static type checking, and `pytest` plus plugins for running the test suite.
For more information on developing pystencils, see the [](#contribution_guide).
:::
### For Nvidia GPUs
......
......@@ -28,7 +28,7 @@ import matplotlib.pyplot as plt
Pystencils offers code generation for Nvidia GPUs using the CUDA programming model,
as well as just-in-time compilation and execution of CUDA kernels from within Python
based on the [cupy] library.w
based on the [cupy] library.
This section's objective is to give a detailed introduction into the creation of
GPU kernels with pystencils.
......@@ -62,7 +62,7 @@ which operates much in the same way that [NumPy][numpy] works on CPU arrays.
Cupy and NumPy expose nearly the same APIs for array operations;
the difference being that CuPy allocates all its arrays on the GPU
and performs its operations as CUDA kernels.
Also, CuPy exposes a just-in-time-compiler for GPU kernels, which internally calls [nvcc].
Also, CuPy exposes a just-in-time-compiler for GPU kernels, which internally calls [nvrtc].
In pystencils, we use CuPy both to compile and provide executable kernels on-demand from within Python code,
and to allocate and manage the data these kernels can be executed on.
......@@ -81,35 +81,79 @@ kfunc = kernel.compile()
kfunc(f=f_arr, g=g_arr)
```
### Modifying the Launch Configuration
(indexing_and_launch_config)=
## Modify the Indexing Scheme and Launch Configuration
There are two key elements to how the work items of a GPU kernel's iteration space
are mapped onto a GPU launch grid:
- The *indexing scheme* defines the relation between thread indices and iteration space points;
it can be modified through the {any}`gpu.indexing_scheme <GpuOptions.indexing_scheme>` option
and is fixed for the entire kernel.
- The *launch configuration* defines the number of threads per block, and the number of blocks on the grid,
with which the kernel should be launched.
The launch configuration mostly depends on the size of the arrays passed to the kernel,
but parts of it may also be modified.
The launch configuration may change at each kernel invocation.
(linear3d)=
### The Default "Linear3D" Indexing Scheme
By default, *pystencils* will employ a 1:1-mapping between threads and iteration space points
via the global thread indices inside the launch grid; e.g.
```{code-block} C++
ctr_0 = start_0 + step_0 * (blockSize.x * blockIdx.x + threadIdx.x);
ctr_1 = start_1 + step_1 * (blockSize.y * blockIdx.y + threadIdx.y);
ctr_2 = start_2 + step_2 * (blockSize.z * blockIdx.z + threadIdx.z);
```
For most kernels with an at most three-dimensional iteration space,
this behavior is sufficient and desired.
It can be enforced by setting `gpu.indexing_scheme = "Linear3D"`.
If the `Linear3D` indexing scheme is used, you may modifiy the GPU thread block size in two places.
The default block size for the kernel can be set via the {any}`gpu.block_size <GpuOptions.block_size>`
code generator option;
if none is specified, a default depending on the iteration space's dimensionality will be used.
The `kernel.compile()` invocation in the above code produces a {any}`CupyKernelWrapper` callable object.
This object holds the kernel's launch grid configuration
(i.e. the number of thread blocks, and the number of threads per block.)
Pystencils specifies a default value for the block size and if possible,
the number of blocks is automatically inferred in order to cover the entire iteration space.
In addition, the wrapper's interface allows us to customize the GPU launch grid,
by manually setting both the number of threads per block, and the number of blocks on the grid:
The block size can furthermore be modified at the compiled kernel's wrapper object via the
`launch_config.block_size` attribute:
```{code-cell} ipython3
kfunc.block_size = (16, 8, 8)
kfunc.num_blocks = (1, 2, 2)
kfunc = kernel.compile()
kfunc.launch_config.block_size = (256, 2, 1)
# Run the kernel
kfunc(f=f_arr, g=g_arr)
```
For most kernels, setting only the `block_size` is sufficient since pystencils will
automatically compute the number of blocks;
for exceptions to this, see [](#manual_launch_grids).
If `num_blocks` is set manually and the launch grid thus specified is too small, only
a part of the iteration space will be traversed by the kernel;
similarily, if it is too large, it will cause any threads working outside of the iteration bounds to idle.
In any case. pystencils will automatically compute the grid size from the shapes of the kernel's array arguments
and the given thread block size.
(manual_launch_grids)=
### Manual Launch Grids and Non-Cuboid Iteration Patterns
In some cases, it will be unavoidable to set the launch grid size manually;
especially if the code generator is unable to automatically determine the size of the
iteration space.
An example for this is the triangular iteration previously described in the [Kernel Creation Guide](#example_triangular_iteration).
By default, the above indexing schemes will automatically compute the GPU launch configuration
from array shapes and optional user input.
However, it is also possible to override this behavior and instead specify a launch grid manually.
This will even be unavoidable if the code generator cannot precompute the number of points
in the kernel's iteration space.
To specify a manual launch configuration, set the {any}`gpu.manual_launch_grid <GpuOptions.manual_launch_grid>`
option to `True`.
Then, after compiling the kernel, set its block and grid size via the `launch_config` property:
```{code-cell} ipython3
cfg.gpu.manual_launch_grid = True
kernel = ps.create_kernel(update, cfg)
kfunc = kernel.compile()
kfunc.launch_config.block_size = (64, 2, 1)
kfunc.launch_config.grid_size = (4, 2, 1)
```
An example where this is necessary is the triangular iteration
previously described in the [Kernel Creation Guide](#example_triangular_iteration).
Let's set it up once more:
```{code-cell} ipython3
......@@ -149,19 +193,14 @@ cfg.target= ps.Target.CUDA
cfg.iteration_slice = ps.make_slice[:, y:]
```
In this case, it is necessary to set the `gpu.manual_launch_grid` option to `True`;
otherwise, code generation will fail as the code generator cannot figure out
a GPU grid size on its own:
For this kernel, the code generator cannot figure out a launch configuration on its own,
so we need to manually provide one:
```{code-cell} ipython3
cfg.gpu.manual_launch_grid = True
kernel = ps.create_kernel(assignments, cfg).compile()
```
Now, to execute our kernel, we have to manually specify its launch grid:
```{code-cell} ipython3
kernel.launch_config.block_size = (8, 8)
kernel.launch_config.grid_size = (2, 2)
```
......@@ -175,7 +214,7 @@ kernel(f=f_arr)
_draw_ispace(cp.asnumpy(f_arr))
```
We can also observe the effect of decreasing the launch grid size:
We can also observe the effect of decreasing the launch grid size.
```{code-cell} ipython3
kernel.launch_config.block_size = (4, 4)
......@@ -203,5 +242,5 @@ only a part of the triangle is being processed.
[cupy]: https://cupy.dev "CuPy Homepage"
[numpy]: https://numpy.org "NumPy Homepage"
[nvcc]: https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html "NVIDIA CUDA Compiler Driver"
[nvrtc]: https://docs.nvidia.com/cuda/nvrtc/index.html "NVIDIA Runtime Compilation Library"
[cupy-docs]: https://docs.cupy.dev/en/stable/overview.html "CuPy Documentation"
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment