Skip to content
Snippets Groups Projects
Commit 2116e907 authored by Frederik Hennig's avatar Frederik Hennig
Browse files

Add design document on CUDA codegen to the docs

parent 6b3f5288
No related branches found
No related tags found
1 merge request!449GPU Indexing Schemes and Launch Configurations
Pipeline #74195 passed
......@@ -337,6 +337,7 @@ build-documentation:
artifacts:
paths:
- docs/build/html
when: always
pages:
......
# GPU Code Generation
The code generation infrastructure for Nvidia and AMD GPUs using CUDA and HIP comprises the following components:
- The {any}`CudaPlatform` at `backend.platforms` which performs materialization of a kernel's iteration
space by mapping GPU block and thread indices to iteration space points. To perform this task,
it depends on a {any}`ThreadMapping` instance which defines the nature of that mapping.
The platform also takes care of lowering mathematical functions to their CUDA runtime library implementation.
- In the code generation driver, the strings are drawn by the `GpuIndexing` helper class.
It provides both the {any}`ThreadMapping` for the codegen backend, as well as the launch configuration
for the runtime system.
:::{attention}
Code generation for HIP through the `CudaPlatform` is experimental and not tested at the moment.
:::
## The CUDA Platform and Thread Mappings
```{eval-rst}
.. module:: pystencils.backend.platforms.cuda
.. autosummary::
:toctree: generated
:nosignatures:
:template: autosummary/entire_class.rst
ThreadMapping
Linear3DMapping
Blockwise4DMapping
```
## Thread Indexing In The Driver
With regard to GPU thread indexing, the code generation driver has two tasks:
it must provide the Cuda platform object with a valid thread mapping,
and must also provide the runtime system with a [launch configuration](#gpu_launch_config)
which defines the shape of the GPU block grid.
Both of these are produced by the {any}`GpuIndexing` class.
It is instantiated with the GPU indexing scheme and indexing options given by the user.
At this time, the backend and code generation driver support two indexing schemes:
"Linear3D" (see {any}`Linear3DMapping`) and "Blockwise4D" (see {any}`Blockwise4DMapping`).
These are mostly reimplemented from the pystencils 1.3.x `"block"` and `"line"` indexing options.
The GPU indexing system may be extended in the future.
```{eval-rst}
.. module:: pystencils.codegen.gpu_indexing
.. autosummary::
:toctree: generated
:nosignatures:
:template: autosummary/entire_class.rst
GpuIndexing
```
(gpu_launch_config)=
## The Launch Configuration
The launch configuration is attached to the `GpuKernel` and thus returned to the runtime system.
Since a concrete launch configuration is not specific to the kernel itself, but to the kernels'
invocation site, the code generator only attaches a *factory function* for launch configurations
to `GpuKernel`. It is up to the runtime system to locally instantiate and configure a launch configuration.
To determine the actual launch grid, the launch configuration must be evaluated at the kernel's call site
by passing the required parameters to `GpuLaunchConfiguration.evaluate`
The {any}`CupyJit`, for instance, will create the launch configuration object while preparing the JIT-compiled
kernel wrapper object. The launch config is there exposed to the user, who may modify some of its properties.
These depend on the type of the launch configuration:
while the `AutomaticLaunchConfiguration` permits no modification and computes grid and block size directly from kernel
parameters,
the `ManualLaunchConfiguration` requires the user to manually specifiy both grid and block size.
The `evaluate` method can only be used from within a Python runtime environment.
When exporting pystencils CUDA kernels for external use in C++ projects,
equivalent C++ code evaluating the launch config must be generated.
This is the task of, e..g., [pystencils-sfg](https://pycodegen.pages.i10git.cs.fau.de/pystencils-sfg/).
```{eval-rst}
.. autosummary::
:toctree: generated
:nosignatures:
:template: autosummary/entire_class.rst
GpuLaunchConfiguration
AutomaticLaunchConfiguration
ManualLaunchConfiguration
DynamicBlockSizeLaunchConfiguration
```
......@@ -17,6 +17,7 @@ who wish to customize or extend the behaviour of the code generator in their app
translation
platforms
transformations
gpu_codegen
errors
extensions
......
# Platforms
All target-specific code generation in the pystencils backend is facilitated
through the *platform classes*.
This includes:
- Materialization of the iteration space, meaning the mapping of iteration space points to some indexing structure
- Lowering of mathematical functions to their implementation in some runtime environment
- Selection of vector intrinsics for SIMD-capable CPU targets
Encapsulation of hardware- and environment-specific details into platform objects allows
us to implement most of the code generator in a generic and hardware-agnostic way.
It also makes it easier to extend pystencils with support for additional code generation
targets in the future.
## Base Classes
```{eval-rst}
.. module:: pystencils.backend.platforms
.. autosummary::
:toctree: generated
:nosignatures:
:template: autosummary/entire_class.rst
Platform
GenericCpu
GenericVectorCpu
GenericGpu
```
## CPU Platforms
```{eval-rst}
.. autosummary::
:toctree: generated
:nosignatures:
:template: autosummary/entire_class.rst
X86VectorCpu
X86VectorArch
```
## GPU Platforms
```{eval-rst}
.. autosummary::
:toctree: generated
:nosignatures:
:template: autosummary/entire_class.rst
CudaPlatform
SyclPlatform
```
*********
Platforms
*********
.. automodule:: pystencils.backend.platforms
:members:
\ No newline at end of file
......@@ -46,7 +46,7 @@ GRID_DIM = [
]
class ThreadToIndexMapping(ABC):
class ThreadMapping(ABC):
@abstractmethod
def __call__(self, ispace: IterationSpace) -> dict[PsSymbol, PsExpression]:
......@@ -57,7 +57,7 @@ class ThreadToIndexMapping(ABC):
"""
class Linear3DMapping(ThreadToIndexMapping):
class Linear3DMapping(ThreadMapping):
"""3D globally linearized mapping, where each thread is assigned a work item according to
its location in the global launch grid."""
......@@ -109,7 +109,7 @@ class Linear3DMapping(ThreadToIndexMapping):
return block_idx * block_size + thread_idx
class Blockwise4DMapping(ThreadToIndexMapping):
class Blockwise4DMapping(ThreadMapping):
"""Blockwise index mapping for up to 4D iteration spaces, where the outer three dimensions
are mapped to block indices."""
......@@ -162,13 +162,20 @@ class Blockwise4DMapping(ThreadToIndexMapping):
class CudaPlatform(GenericGpu):
"""Platform for CUDA-based GPUs."""
"""Platform for CUDA-based GPUs.
Args:
ctx: The kernel creation context
omit_range_check: If `True`, generated index translation code will not check if the point identified
by block and thread indices is actually contained in the iteration space
thread_mapping: Callback object which defines the mapping of thread indices onto iteration space points
"""
def __init__(
self,
ctx: KernelCreationContext,
omit_range_check: bool = False,
thread_mapping: ThreadToIndexMapping | None = None,
thread_mapping: ThreadMapping | None = None,
) -> None:
super().__init__(ctx)
......
from __future__ import annotations
from abc import abstractmethod
from ..ast.structural import PsBlock
from ..kernelcreation.iteration_space import IterationSpace
from .platform import Platform
class GenericGpu(Platform):
@abstractmethod
def materialize_iteration_space(
self, body: PsBlock, ispace: IterationSpace
) -> PsBlock:
pass
"""Base class for GPU platforms."""
......@@ -11,9 +11,9 @@ class Platform(ABC):
"""Abstract base class for all supported platforms.
The platform performs all target-dependent tasks during code generation:
- Translation of the iteration space to an index source (loop nest, GPU indexing, ...)
- Platform-specific optimizations (e.g. vectorization, OpenMP)
- Translation of the iteration space to an index source (loop nest, GPU indexing, ...)
- Platform-specific optimizations (e.g. vectorization, OpenMP)
"""
def __init__(self, ctx: KernelCreationContext) -> None:
......@@ -22,12 +22,16 @@ class Platform(ABC):
@property
@abstractmethod
def required_headers(self) -> set[str]:
"""Set of header files that must be included at the point of definition of a kernel
running on this platform."""
pass
@abstractmethod
def materialize_iteration_space(
self, body: PsBlock, ispace: IterationSpace
) -> PsBlock:
"""Materialize the given iteration space as an indexing structure and embed the given
kernel body into that structure."""
pass
@abstractmethod
......
......@@ -14,6 +14,7 @@ from ..backend.kernelcreation import (
FullIterationSpace,
SparseIterationSpace,
)
from ..backend.platforms.cuda import ThreadMapping
from ..backend.ast.expressions import PsExpression
......@@ -198,24 +199,41 @@ class DynamicBlockSizeLaunchConfiguration(GpuLaunchConfiguration):
return self._block_size
class GpuIndexing(ABC):
class GpuIndexing:
"""Factory for GPU indexing objects required during code generation.
This class acts as a helper class for the code generation driver.
It produces both the `ThreadMapping` required by the backend,
as well as factories for the launch configuration required later by the runtime system.
Args:
ctx: The kernel creation context
scheme: The desired GPU indexing scheme
block_size: A user-defined default block size, required only if the indexing scheme permits
modification of the block size
manual_launch_grid: If `True`, always emit a `ManualLaunchConfiguration` to force the runtime system
to set the launch configuration explicitly
"""
def __init__(
self,
ctx: KernelCreationContext,
scheme: GpuIndexingScheme,
block_size: dim3 | _AUTO_TYPE,
manual_launch_grid: bool,
default_block_size: dim3 | _AUTO_TYPE | None = None,
manual_launch_grid: bool = False,
) -> None:
self._ctx = ctx
self._scheme = scheme
self._block_size = block_size
self._default_block_size = default_block_size
self._manual_launch_grid = manual_launch_grid
from ..backend.kernelcreation import AstFactory
self._factory = AstFactory(self._ctx)
def get_thread_mapping(self):
def get_thread_mapping(self) -> ThreadMapping:
"""Retrieve a thread mapping object for use by the backend"""
from ..backend.platforms.cuda import Linear3DMapping, Blockwise4DMapping
match self._scheme:
......@@ -225,6 +243,7 @@ class GpuIndexing(ABC):
return Blockwise4DMapping()
def get_launch_config_factory(self) -> Callable[[], GpuLaunchConfiguration]:
"""Retrieve a factory for the launch configuration for later consumption by the runtime system"""
if self._manual_launch_grid:
return ManualLaunchConfiguration
......@@ -254,7 +273,10 @@ class GpuIndexing(ABC):
return factory
def _get_default_block_size(self, rank: int) -> dim3:
if isinstance(self._block_size, _AUTO_TYPE):
if self._default_block_size is None:
raise CodegenError("The default block size option was not set")
if isinstance(self._default_block_size, _AUTO_TYPE):
match rank:
case 1:
return (256, 1, 1)
......@@ -265,7 +287,7 @@ class GpuIndexing(ABC):
case _:
assert False, "unreachable code"
else:
return self._block_size
return self._default_block_size
def _get_blockwise4d_config_factory(
self,
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment