Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
pystencils
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Model registry
Operate
Environments
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Terms and privacy
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
pycodegen
pystencils
Commits
9ba7bd25
Commit
9ba7bd25
authored
3 months ago
by
Frederik Hennig
Browse files
Options
Downloads
Patches
Plain Diff
update GPU user guide; fix some other details in docs
parent
c306be59
No related branches found
No related tags found
1 merge request
!449
GPU Indexing Schemes and Launch Configurations
Pipeline
#74081
passed
3 months ago
Stage: Code Quality
Stage: Unit Tests
Stage: legacy_test
Stage: docs
Changes
3
Pipelines
1
Hide whitespace changes
Inline
Side-by-side
Showing
3 changed files
docs/source/contributing/index.md
+1
-0
1 addition, 0 deletions
docs/source/contributing/index.md
docs/source/installation.md
+3
-1
3 additions, 1 deletion
docs/source/installation.md
docs/source/user_manual/gpu_kernels.md
+70
-31
70 additions, 31 deletions
docs/source/user_manual/gpu_kernels.md
with
74 additions
and
32 deletions
docs/source/contributing/index.md
+
1
−
0
View file @
9ba7bd25
(contribution_guide)=
# Contribution Guide
Welcome to the Contributor's Guide to pystencils!
...
...
This diff is collapsed.
Click to expand it.
docs/source/installation.md
+
3
−
1
View file @
9ba7bd25
...
...
@@ -38,12 +38,14 @@ The following feature sets are available:
If you are developing pystencils, we recommend you perform an editable install of your
local clone of the repository, with all optional features:
```
bash
pip
install
-e
pystencils[alltrafos,interactive,use_cython,doc,tests]
pip
install
-e
pystencils[alltrafos,interactive,use_cython,doc,tests
uite
]
```
This includes the additional feature groups
`doc`
, which contains all dependencies required
to build this documentation, and
`tests`
, which adds
`flake8`
for code style checking,
`mypy`
for static type checking, and
`pytest`
plus plugins for running the test suite.
For more information on developing pystencils, see the
[](
#contribution_guide
)
.
:::
### For Nvidia GPUs
...
...
This diff is collapsed.
Click to expand it.
docs/source/user_manual/gpu_kernels.md
+
70
−
31
View file @
9ba7bd25
...
...
@@ -28,7 +28,7 @@ import matplotlib.pyplot as plt
Pystencils offers code generation for Nvidia GPUs using the CUDA programming model,
as well as just-in-time compilation and execution of CUDA kernels from within Python
based on the [cupy] library.
w
based on the [cupy] library.
This section's objective is to give a detailed introduction into the creation of
GPU kernels with pystencils.
...
...
@@ -62,7 +62,7 @@ which operates much in the same way that [NumPy][numpy] works on CPU arrays.
Cupy and NumPy expose nearly the same APIs for array operations;
the difference being that CuPy allocates all its arrays on the GPU
and performs its operations as CUDA kernels.
Also, CuPy exposes a just-in-time-compiler for GPU kernels, which internally calls [nv
c
c].
Also, CuPy exposes a just-in-time-compiler for GPU kernels, which internally calls [nv
rt
c].
In pystencils, we use CuPy both to compile and provide executable kernels on-demand from within Python code,
and to allocate and manage the data these kernels can be executed on.
...
...
@@ -81,35 +81,79 @@ kfunc = kernel.compile()
kfunc(f=f_arr, g=g_arr)
```
### Modifying the Launch Configuration
(indexing_and_launch_config)=
## Modify the Indexing Scheme and Launch Configuration
There are two key elements to how the work items of a GPU kernel's iteration space
are mapped onto a GPU launch grid:
-
The
*indexing scheme*
defines the relation between thread indices and iteration space points;
it can be modified through the {any}
`gpu.indexing_scheme <GpuOptions.indexing_scheme>`
option
and is fixed for the entire kernel.
-
The
*launch configuration*
defines the number of threads per block, and the number of blocks on the grid,
with which the kernel should be launched.
The launch configuration mostly depends on the size of the arrays passed to the kernel,
but parts of it may also be modified.
The launch configuration may change at each kernel invocation.
(linear3d)=
### The Default "Linear3D" Indexing Scheme
By default,
*pystencils*
will employ a 1:1-mapping between threads and iteration space points
via the global thread indices inside the launch grid; e.g.
```
{code-block} C++
ctr_0 = start_0 + step_0 * (blockSize.x * blockIdx.x + threadIdx.x);
ctr_1 = start_1 + step_1 * (blockSize.y * blockIdx.y + threadIdx.y);
ctr_2 = start_2 + step_2 * (blockSize.z * blockIdx.z + threadIdx.z);
```
For most kernels with an at most three-dimensional iteration space,
this behavior is sufficient and desired.
It can be enforced by setting
`gpu.indexing_scheme = "Linear3D"`
.
If the
`Linear3D`
indexing scheme is used, you may modifiy the GPU thread block size in two places.
The default block size for the kernel can be set via the {any}
`gpu.block_size <GpuOptions.block_size>`
code generator option;
if none is specified, a default depending on the iteration space's dimensionality will be used.
The
`kernel.compile()`
invocation in the above code produces a {any}
`CupyKernelWrapper`
callable object.
This object holds the kernel's launch grid configuration
(i.e. the number of thread blocks, and the number of threads per block.)
Pystencils specifies a default value for the block size and if possible,
the number of blocks is automatically inferred in order to cover the entire iteration space.
In addition, the wrapper's interface allows us to customize the GPU launch grid,
by manually setting both the number of threads per block, and the number of blocks on the grid:
The block size can furthermore be modified at the compiled kernel's wrapper object via the
`launch_config.block_size`
attribute:
```
{code-cell} ipython3
kfunc.block_size = (16, 8, 8)
kfunc.num_blocks = (1, 2, 2)
kfunc = kernel.compile()
kfunc.launch_config.block_size = (256, 2, 1)
# Run the kernel
kfunc(f=f_arr, g=g_arr)
```
For most kernels, setting only the
`block_size`
is sufficient since pystencils will
automatically compute the number of blocks;
for exceptions to this, see
[](
#manual_launch_grids
)
.
If
`num_blocks`
is set manually and the launch grid thus specified is too small, only
a part of the iteration space will be traversed by the kernel;
similarily, if it is too large, it will cause any threads working outside of the iteration bounds to idle.
In any case. pystencils will automatically compute the grid size from the shapes of the kernel's array arguments
and the given thread block size.
(manual_launch_grids)=
### Manual Launch Grids and Non-Cuboid Iteration Patterns
In some cases, it will be unavoidable to set the launch grid size manually;
especially if the code generator is unable to automatically determine the size of the
iteration space.
An example for this is the triangular iteration previously described in the
[
Kernel Creation Guide
](
#example_triangular_iteration
)
.
By default, the above indexing schemes will automatically compute the GPU launch configuration
from array shapes and optional user input.
However, it is also possible to override this behavior and instead specify a launch grid manually.
This will even be unavoidable if the code generator cannot precompute the number of points
in the kernel's iteration space.
To specify a manual launch configuration, set the {any}
`gpu.manual_launch_grid <GpuOptions.manual_launch_grid>`
option to
`True`
.
Then, after compiling the kernel, set its block and grid size via the
`launch_config`
property:
```
{code-cell} ipython3
cfg.gpu.manual_launch_grid = True
kernel = ps.create_kernel(update, cfg)
kfunc = kernel.compile()
kfunc.launch_config.block_size = (64, 2, 1)
kfunc.launch_config.grid_size = (4, 2, 1)
```
An example where this is necessary is the triangular iteration
previously described in the
[
Kernel Creation Guide
](
#example_triangular_iteration
)
.
Let's set it up once more:
```
{code-cell} ipython3
...
...
@@ -149,19 +193,14 @@ cfg.target= ps.Target.CUDA
cfg.iteration_slice = ps.make_slice[:, y:]
```
In this case, it is necessary to set the
`gpu.manual_launch_grid`
option to
`True`
;
otherwise, code generation will fail as the code generator cannot figure out
a GPU grid size on its own:
For this kernel, the code generator cannot figure out a launch configuration on its own,
so we need to manually provide one:
```
{code-cell} ipython3
cfg.gpu.manual_launch_grid = True
kernel = ps.create_kernel(assignments, cfg).compile()
```
Now, to execute our kernel, we have to manually specify its launch grid:
```
{code-cell} ipython3
kernel.launch_config.block_size = (8, 8)
kernel.launch_config.grid_size = (2, 2)
```
...
...
@@ -175,7 +214,7 @@ kernel(f=f_arr)
_draw_ispace(cp.asnumpy(f_arr))
```
We can also observe the effect of decreasing the launch grid size
:
We can also observe the effect of decreasing the launch grid size
.
```
{code-cell} ipython3
kernel.launch_config.block_size = (4, 4)
...
...
@@ -203,5 +242,5 @@ only a part of the triangle is being processed.
[
cupy
]:
https://cupy.dev
"CuPy Homepage"
[
numpy
]:
https://numpy.org
"NumPy Homepage"
[
nv
c
c
]:
https://docs.nvidia.com/cuda/
cuda-compiler-driver-nvcc/index.html
"NVIDIA CUDA Compiler Driver
"
[
nv
rt
c
]:
https://docs.nvidia.com/cuda/
nvrtc/index.html
"NVIDIA Runtime Compilation Library
"
[
cupy-docs
]:
https://docs.cupy.dev/en/stable/overview.html
"CuPy Documentation"
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment