diff --git a/docs/source/contributing/index.md b/docs/source/contributing/index.md
index 56c97509cbc4c0e3f312fade9fd08af90c4c9c3d..8be86cdd436bf4e7079a3a4cae3ac17748ceb9e6 100644
--- a/docs/source/contributing/index.md
+++ b/docs/source/contributing/index.md
@@ -1,3 +1,4 @@
+(contribution_guide)=
 # Contribution Guide
 
 Welcome to the Contributor's Guide to pystencils!
diff --git a/docs/source/installation.md b/docs/source/installation.md
index cea0acd2feba1c45f2e73b0d52292a92aaa28d07..deb2b0613564f98468f623544acf3cc1ca9d279e 100644
--- a/docs/source/installation.md
+++ b/docs/source/installation.md
@@ -38,12 +38,14 @@ The following feature sets are available:
 If you are developing pystencils, we recommend you perform an editable install of your
 local clone of the repository, with all optional features:
 ```bash
-pip install -e pystencils[alltrafos,interactive,use_cython,doc,tests]
+pip install -e pystencils[alltrafos,interactive,use_cython,doc,testsuite]
 ```
 
 This includes the additional feature groups `doc`, which contains all dependencies required
 to build this documentation, and `tests`, which adds `flake8` for code style checking,
 `mypy` for static type checking, and `pytest` plus plugins for running the test suite.
+
+For more information on developing pystencils, see the [](#contribution_guide).
 :::
 
 ### For Nvidia GPUs
diff --git a/docs/source/user_manual/gpu_kernels.md b/docs/source/user_manual/gpu_kernels.md
index d3a49170779838f4c3097ff56316895444cae552..2fa7cd0560e4557fd6471ea9c4a4c11160157786 100644
--- a/docs/source/user_manual/gpu_kernels.md
+++ b/docs/source/user_manual/gpu_kernels.md
@@ -28,7 +28,7 @@ import matplotlib.pyplot as plt
 
 Pystencils offers code generation for Nvidia GPUs using the CUDA programming model,
 as well as just-in-time compilation and execution of CUDA kernels from within Python
-based on the [cupy] library.w
+based on the [cupy] library.
 This section's objective is to give a detailed introduction into the creation of
 GPU kernels with pystencils.
 
@@ -62,7 +62,7 @@ which operates much in the same way that [NumPy][numpy] works on CPU arrays.
 Cupy and NumPy expose nearly the same APIs for array operations;
 the difference being that CuPy allocates all its arrays on the GPU
 and performs its operations as CUDA kernels.
-Also, CuPy exposes a just-in-time-compiler for GPU kernels, which internally calls [nvcc].
+Also, CuPy exposes a just-in-time-compiler for GPU kernels, which internally calls [nvrtc].
 In pystencils, we use CuPy both to compile and provide executable kernels on-demand from within Python code,
 and to allocate and manage the data these kernels can be executed on.
 
@@ -81,35 +81,79 @@ kfunc = kernel.compile()
 kfunc(f=f_arr, g=g_arr)
 ```
 
-### Modifying the Launch Configuration
+(indexing_and_launch_config)=
+## Modify the Indexing Scheme and Launch Configuration
+
+There are two key elements to how the work items of a GPU kernel's iteration space
+are mapped onto a GPU launch grid:
+ - The *indexing scheme* defines the relation between thread indices and iteration space points;
+   it can be modified through the {any}`gpu.indexing_scheme <GpuOptions.indexing_scheme>` option
+   and is fixed for the entire kernel.
+ - The *launch configuration* defines the number of threads per block, and the number of blocks on the grid,
+   with which the kernel should be launched.
+   The launch configuration mostly depends on the size of the arrays passed to the kernel,
+   but parts of it may also be modified.
+   The launch configuration may change at each kernel invocation.
+
+(linear3d)=
+### The Default "Linear3D" Indexing Scheme
+
+By default, *pystencils* will employ a 1:1-mapping between threads and iteration space points
+via the global thread indices inside the launch grid; e.g.
+
+```{code-block} C++
+ctr_0 = start_0 + step_0 * (blockSize.x * blockIdx.x + threadIdx.x);
+ctr_1 = start_1 + step_1 * (blockSize.y * blockIdx.y + threadIdx.y);
+ctr_2 = start_2 + step_2 * (blockSize.z * blockIdx.z + threadIdx.z);
+```
+
+For most kernels with an at most three-dimensional iteration space,
+this behavior is sufficient and desired.
+It can be enforced by setting `gpu.indexing_scheme = "Linear3D"`.
+
+If the `Linear3D` indexing scheme is used, you may modifiy the GPU thread block size in two places.
+The default block size for the kernel can be set via the {any}`gpu.block_size <GpuOptions.block_size>` 
+code generator option;
+if none is specified, a default depending on the iteration space's dimensionality will be used.
 
-The `kernel.compile()` invocation in the above code produces a {any}`CupyKernelWrapper` callable object.
-This object holds the kernel's launch grid configuration
-(i.e. the number of thread blocks, and the number of threads per block.)
-Pystencils specifies a default value for the block size and if possible, 
-the number of blocks is automatically inferred in order to cover the entire iteration space.
-In addition, the wrapper's interface allows us to customize the GPU launch grid,
-by manually setting both the number of threads per block, and the number of blocks on the grid:
+The block size can furthermore be modified at the compiled kernel's wrapper object via the
+`launch_config.block_size` attribute:
 
 ```{code-cell} ipython3
-kfunc.block_size = (16, 8, 8)
-kfunc.num_blocks = (1, 2, 2)
+kfunc = kernel.compile()
+kfunc.launch_config.block_size = (256, 2, 1)
+
+# Run the kernel
+kfunc(f=f_arr, g=g_arr)
 ```
 
-For most kernels, setting only the `block_size` is sufficient since pystencils will
-automatically compute the number of blocks;
-for exceptions to this, see [](#manual_launch_grids).
-If `num_blocks` is set manually and the launch grid thus specified is too small, only
-a part of the iteration space will be traversed by the kernel;
-similarily, if it is too large, it will cause any threads working outside of the iteration bounds to idle.
+In any case. pystencils will automatically compute the grid size from the shapes of the kernel's array arguments
+and the given thread block size.
 
 (manual_launch_grids)=
 ### Manual Launch Grids and Non-Cuboid Iteration Patterns
 
-In some cases, it will be unavoidable to set the launch grid size manually;
-especially if the code generator is unable to automatically determine the size of the
-iteration space.
-An example for this is the triangular iteration previously described in the [Kernel Creation Guide](#example_triangular_iteration).
+By default, the above indexing schemes will automatically compute the GPU launch configuration
+from array shapes and optional user input.
+However, it is also possible to override this behavior and instead specify a launch grid manually.
+This will even be unavoidable if the code generator cannot precompute the number of points
+in the kernel's iteration space.
+
+To specify a manual launch configuration, set the {any}`gpu.manual_launch_grid <GpuOptions.manual_launch_grid>`
+option to `True`.
+Then, after compiling the kernel, set its block and grid size via the `launch_config` property:
+
+```{code-cell} ipython3
+cfg.gpu.manual_launch_grid = True
+
+kernel = ps.create_kernel(update, cfg)
+kfunc = kernel.compile()
+kfunc.launch_config.block_size = (64, 2, 1)
+kfunc.launch_config.grid_size = (4, 2, 1)
+```
+
+An example where this is necessary is the triangular iteration
+previously described in the [Kernel Creation Guide](#example_triangular_iteration).
 Let's set it up once more:
 
 ```{code-cell} ipython3
@@ -149,19 +193,14 @@ cfg.target= ps.Target.CUDA
 cfg.iteration_slice = ps.make_slice[:, y:]
 ```
 
-In this case, it is necessary to set the `gpu.manual_launch_grid` option to `True`;
-otherwise, code generation will fail as the code generator cannot figure out
-a GPU grid size on its own:
+For this kernel, the code generator cannot figure out a launch configuration on its own,
+so we need to manually provide one:
 
 ```{code-cell} ipython3
 cfg.gpu.manual_launch_grid = True
     
 kernel = ps.create_kernel(assignments, cfg).compile()
-```
 
-Now, to execute our kernel, we have to manually specify its launch grid:
-
-```{code-cell} ipython3
 kernel.launch_config.block_size = (8, 8)
 kernel.launch_config.grid_size = (2, 2)
 ```
@@ -175,7 +214,7 @@ kernel(f=f_arr)
 _draw_ispace(cp.asnumpy(f_arr))
 ```
 
-We can also observe the effect of decreasing the launch grid size:
+We can also observe the effect of decreasing the launch grid size.
 
 ```{code-cell} ipython3
 kernel.launch_config.block_size = (4, 4)
@@ -203,5 +242,5 @@ only a part of the triangle is being processed.
 
 [cupy]: https://cupy.dev "CuPy Homepage"
 [numpy]: https://numpy.org "NumPy Homepage"
-[nvcc]: https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html "NVIDIA CUDA Compiler Driver"
+[nvrtc]: https://docs.nvidia.com/cuda/nvrtc/index.html "NVIDIA Runtime Compilation Library"
 [cupy-docs]: https://docs.cupy.dev/en/stable/overview.html "CuPy Documentation"