diff --git a/docs/source/user_manual/gpu_kernels.md b/docs/source/user_manual/gpu_kernels.md
index 2fa7cd0560e4557fd6471ea9c4a4c11160157786..610c61ddf647331d7b77b06968e489b4dcc76293 100644
--- a/docs/source/user_manual/gpu_kernels.md
+++ b/docs/source/user_manual/gpu_kernels.md
@@ -54,7 +54,19 @@ It extends {py:class}`Kernel` with some GPU-specific information.
 
 If a GPU is available and [CuPy][cupy] is installed in the current environment,
 the kernel can be compiled and run immediately.
-To execute the kernel, a {any}`cupy.ndarray` has to be passed for each field.
+To execute the kernel, a {any}`cupy.ndarray` has to be passed for each field:
+
+```{code-cell} ipython3
+:tags: [raises-exception]
+import cupy as cp
+
+rng = cp.random.default_rng(seed=42)
+f_arr = rng.random((16, 16, 16))
+g_arr = cp.zeros_like(f_arr)
+
+kfunc = kernel.compile()
+kfunc(f=f_arr, g=g_arr)
+```
 
 :::{note}
 [CuPy][cupy] is a Python library for numerical computations on GPU arrays,
@@ -69,18 +81,6 @@ and to allocate and manage the data these kernels can be executed on.
 For more information on CuPy, refer to [their documentation][cupy-docs].
 :::
 
-```{code-cell} ipython3
-:tags: [raises-exception]
-import cupy as cp
-
-rng = cp.random.default_rng(seed=42)
-f_arr = rng.random((16, 16, 16))
-g_arr = cp.zeros_like(f_arr)
-
-kfunc = kernel.compile()
-kfunc(f=f_arr, g=g_arr)
-```
-
 (indexing_and_launch_config)=
 ## Modify the Indexing Scheme and Launch Configuration
 
@@ -130,6 +130,25 @@ kfunc(f=f_arr, g=g_arr)
 In any case. pystencils will automatically compute the grid size from the shapes of the kernel's array arguments
 and the given thread block size.
 
+:::{attention}
+
+According to the way GPU architecture splits thread blocks into warps,
+pystencils will map the kernel's *fastest* spatial coordinate onto the `x` block and thread
+indices, the second-fastest to `y`, and the slowest coordiante to `z`.
+
+This can mean that, when using `cupy` arrays with the default memory layout
+(corresponding to the `"numpy"` field layout specifier),
+the *thread coordinates* and the *spatial coordinates*
+map to each other in *opposite order*; e.g.
+
+| Spatial Coordinate | Thread Index  |
+|--------------------|---------------|
+| `x` (slowest)      | `threadIdx.z` |
+| `y`                | `threadIdx.y` |
+| `z` (fastest)      | `threadIdx.x` |
+
+:::
+
 (manual_launch_grids)=
 ### Manual Launch Grids and Non-Cuboid Iteration Patterns
 
diff --git a/tests/kernelcreation/test_gpu.py b/tests/kernelcreation/test_gpu.py
index d80647fb6a620182c008a9cb9a65adb8f16b0b4e..75239c9b10c404c6acf88c61908293c578623ba6 100644
--- a/tests/kernelcreation/test_gpu.py
+++ b/tests/kernelcreation/test_gpu.py
@@ -67,11 +67,7 @@ def test_indexing_options(
                 kernel.launch_config.grid_size = (32, 16, 1)
 
     elif indexing_scheme == "linear3d":
-        kernel.launch_config.block_size = (
-            10,
-            8,
-            8,
-        )  # must fit the src_arr shape (without ghost layers)
+        kernel.launch_config.block_size = (10, 8, 8)
 
     kernel(src=src_arr, dst=dst_arr)