diff --git a/docs/source/contributing/index.md b/docs/source/contributing/index.md index 56c97509cbc4c0e3f312fade9fd08af90c4c9c3d..8be86cdd436bf4e7079a3a4cae3ac17748ceb9e6 100644 --- a/docs/source/contributing/index.md +++ b/docs/source/contributing/index.md @@ -1,3 +1,4 @@ +(contribution_guide)= # Contribution Guide Welcome to the Contributor's Guide to pystencils! diff --git a/docs/source/installation.md b/docs/source/installation.md index cea0acd2feba1c45f2e73b0d52292a92aaa28d07..deb2b0613564f98468f623544acf3cc1ca9d279e 100644 --- a/docs/source/installation.md +++ b/docs/source/installation.md @@ -38,12 +38,14 @@ The following feature sets are available: If you are developing pystencils, we recommend you perform an editable install of your local clone of the repository, with all optional features: ```bash -pip install -e pystencils[alltrafos,interactive,use_cython,doc,tests] +pip install -e pystencils[alltrafos,interactive,use_cython,doc,testsuite] ``` This includes the additional feature groups `doc`, which contains all dependencies required to build this documentation, and `tests`, which adds `flake8` for code style checking, `mypy` for static type checking, and `pytest` plus plugins for running the test suite. + +For more information on developing pystencils, see the [](#contribution_guide). ::: ### For Nvidia GPUs diff --git a/docs/source/user_manual/gpu_kernels.md b/docs/source/user_manual/gpu_kernels.md index d3a49170779838f4c3097ff56316895444cae552..2fa7cd0560e4557fd6471ea9c4a4c11160157786 100644 --- a/docs/source/user_manual/gpu_kernels.md +++ b/docs/source/user_manual/gpu_kernels.md @@ -28,7 +28,7 @@ import matplotlib.pyplot as plt Pystencils offers code generation for Nvidia GPUs using the CUDA programming model, as well as just-in-time compilation and execution of CUDA kernels from within Python -based on the [cupy] library.w +based on the [cupy] library. This section's objective is to give a detailed introduction into the creation of GPU kernels with pystencils. @@ -62,7 +62,7 @@ which operates much in the same way that [NumPy][numpy] works on CPU arrays. Cupy and NumPy expose nearly the same APIs for array operations; the difference being that CuPy allocates all its arrays on the GPU and performs its operations as CUDA kernels. -Also, CuPy exposes a just-in-time-compiler for GPU kernels, which internally calls [nvcc]. +Also, CuPy exposes a just-in-time-compiler for GPU kernels, which internally calls [nvrtc]. In pystencils, we use CuPy both to compile and provide executable kernels on-demand from within Python code, and to allocate and manage the data these kernels can be executed on. @@ -81,35 +81,79 @@ kfunc = kernel.compile() kfunc(f=f_arr, g=g_arr) ``` -### Modifying the Launch Configuration +(indexing_and_launch_config)= +## Modify the Indexing Scheme and Launch Configuration + +There are two key elements to how the work items of a GPU kernel's iteration space +are mapped onto a GPU launch grid: + - The *indexing scheme* defines the relation between thread indices and iteration space points; + it can be modified through the {any}`gpu.indexing_scheme <GpuOptions.indexing_scheme>` option + and is fixed for the entire kernel. + - The *launch configuration* defines the number of threads per block, and the number of blocks on the grid, + with which the kernel should be launched. + The launch configuration mostly depends on the size of the arrays passed to the kernel, + but parts of it may also be modified. + The launch configuration may change at each kernel invocation. + +(linear3d)= +### The Default "Linear3D" Indexing Scheme + +By default, *pystencils* will employ a 1:1-mapping between threads and iteration space points +via the global thread indices inside the launch grid; e.g. + +```{code-block} C++ +ctr_0 = start_0 + step_0 * (blockSize.x * blockIdx.x + threadIdx.x); +ctr_1 = start_1 + step_1 * (blockSize.y * blockIdx.y + threadIdx.y); +ctr_2 = start_2 + step_2 * (blockSize.z * blockIdx.z + threadIdx.z); +``` + +For most kernels with an at most three-dimensional iteration space, +this behavior is sufficient and desired. +It can be enforced by setting `gpu.indexing_scheme = "Linear3D"`. + +If the `Linear3D` indexing scheme is used, you may modifiy the GPU thread block size in two places. +The default block size for the kernel can be set via the {any}`gpu.block_size <GpuOptions.block_size>` +code generator option; +if none is specified, a default depending on the iteration space's dimensionality will be used. -The `kernel.compile()` invocation in the above code produces a {any}`CupyKernelWrapper` callable object. -This object holds the kernel's launch grid configuration -(i.e. the number of thread blocks, and the number of threads per block.) -Pystencils specifies a default value for the block size and if possible, -the number of blocks is automatically inferred in order to cover the entire iteration space. -In addition, the wrapper's interface allows us to customize the GPU launch grid, -by manually setting both the number of threads per block, and the number of blocks on the grid: +The block size can furthermore be modified at the compiled kernel's wrapper object via the +`launch_config.block_size` attribute: ```{code-cell} ipython3 -kfunc.block_size = (16, 8, 8) -kfunc.num_blocks = (1, 2, 2) +kfunc = kernel.compile() +kfunc.launch_config.block_size = (256, 2, 1) + +# Run the kernel +kfunc(f=f_arr, g=g_arr) ``` -For most kernels, setting only the `block_size` is sufficient since pystencils will -automatically compute the number of blocks; -for exceptions to this, see [](#manual_launch_grids). -If `num_blocks` is set manually and the launch grid thus specified is too small, only -a part of the iteration space will be traversed by the kernel; -similarily, if it is too large, it will cause any threads working outside of the iteration bounds to idle. +In any case. pystencils will automatically compute the grid size from the shapes of the kernel's array arguments +and the given thread block size. (manual_launch_grids)= ### Manual Launch Grids and Non-Cuboid Iteration Patterns -In some cases, it will be unavoidable to set the launch grid size manually; -especially if the code generator is unable to automatically determine the size of the -iteration space. -An example for this is the triangular iteration previously described in the [Kernel Creation Guide](#example_triangular_iteration). +By default, the above indexing schemes will automatically compute the GPU launch configuration +from array shapes and optional user input. +However, it is also possible to override this behavior and instead specify a launch grid manually. +This will even be unavoidable if the code generator cannot precompute the number of points +in the kernel's iteration space. + +To specify a manual launch configuration, set the {any}`gpu.manual_launch_grid <GpuOptions.manual_launch_grid>` +option to `True`. +Then, after compiling the kernel, set its block and grid size via the `launch_config` property: + +```{code-cell} ipython3 +cfg.gpu.manual_launch_grid = True + +kernel = ps.create_kernel(update, cfg) +kfunc = kernel.compile() +kfunc.launch_config.block_size = (64, 2, 1) +kfunc.launch_config.grid_size = (4, 2, 1) +``` + +An example where this is necessary is the triangular iteration +previously described in the [Kernel Creation Guide](#example_triangular_iteration). Let's set it up once more: ```{code-cell} ipython3 @@ -149,19 +193,14 @@ cfg.target= ps.Target.CUDA cfg.iteration_slice = ps.make_slice[:, y:] ``` -In this case, it is necessary to set the `gpu.manual_launch_grid` option to `True`; -otherwise, code generation will fail as the code generator cannot figure out -a GPU grid size on its own: +For this kernel, the code generator cannot figure out a launch configuration on its own, +so we need to manually provide one: ```{code-cell} ipython3 cfg.gpu.manual_launch_grid = True kernel = ps.create_kernel(assignments, cfg).compile() -``` -Now, to execute our kernel, we have to manually specify its launch grid: - -```{code-cell} ipython3 kernel.launch_config.block_size = (8, 8) kernel.launch_config.grid_size = (2, 2) ``` @@ -175,7 +214,7 @@ kernel(f=f_arr) _draw_ispace(cp.asnumpy(f_arr)) ``` -We can also observe the effect of decreasing the launch grid size: +We can also observe the effect of decreasing the launch grid size. ```{code-cell} ipython3 kernel.launch_config.block_size = (4, 4) @@ -203,5 +242,5 @@ only a part of the triangle is being processed. [cupy]: https://cupy.dev "CuPy Homepage" [numpy]: https://numpy.org "NumPy Homepage" -[nvcc]: https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html "NVIDIA CUDA Compiler Driver" +[nvrtc]: https://docs.nvidia.com/cuda/nvrtc/index.html "NVIDIA Runtime Compilation Library" [cupy-docs]: https://docs.cupy.dev/en/stable/overview.html "CuPy Documentation"