diff --git a/docs/source/backend/gpu_codegen.md b/docs/source/backend/gpu_codegen.md
index 3fe00840ed13faeb837d91711bca0be1a390aa4e..a95c365665bf803ce3cc683b40319c72dbf857ca 100644
--- a/docs/source/backend/gpu_codegen.md
+++ b/docs/source/backend/gpu_codegen.md
@@ -1,3 +1,4 @@
+(gpu_codegen)=
 # GPU Code Generation
 
 The code generation infrastructure for Nvidia and AMD GPUs using CUDA and HIP comprises the following components:
diff --git a/docs/source/index.rst b/docs/source/index.rst
index 6dba50af184ee95c7378a2e923bd76a6d97883a2..4e1070979fb7e02f3c6be799761f3999fec10476 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -81,6 +81,7 @@ Topics
 
   user_manual/symbolic_language
   user_manual/kernelcreation
+  user_manual/reductions
   user_manual/gpu_kernels
   user_manual/WorkingWithTypes
 
diff --git a/docs/source/user_manual/reductions.md b/docs/source/user_manual/reductions.md
new file mode 100644
index 0000000000000000000000000000000000000000..5b45a921cc9cdb61a369733b007573e61f70c0eb
--- /dev/null
+++ b/docs/source/user_manual/reductions.md
@@ -0,0 +1,202 @@
+---
+jupytext:
+  formats: md:myst
+  text_representation:
+    extension: .md
+    format_name: myst
+    format_version: 0.13
+    jupytext_version: 1.16.4
+kernelspec:
+  display_name: Python 3 (ipykernel)
+  language: python
+  name: python3
+mystnb:
+  execution_mode: cache
+---
+
+```{code-cell} ipython3
+:tags: [remove-cell, raises-exception]
+
+import sympy as sp
+import pystencils as ps
+import numpy as np
+import cupy as cp
+
+from enum import Enum
+```
+
+(guide_reductions)=
+# Reductions in Pystencils
+
+Reductions play a vital role in numerical simulations as they allow aggregating data across multiple elements, 
+such as computing sums, products over an array or finding its minima or maxima.
+
+## Specifying Assignments with Reductions
+
+In pystencils, reductions are made available via specialized assignments, namely `ReductionAssignment`.
+Here is a snippet creating a reduction assignment for adding up all elements of a field:
+
+```{code-cell} ipython3
+r = ps.TypedSymbol("r", "double")
+x = ps.fields(f"x: double[3D]", layout="fzyx")
+
+assign_sum = ps.AddReductionAssignment(r, x.center())
+```
+
+For each point in the iteration space, the left-hand side symbol `r` accumulates the contents of the 
+right-hand side `x.center()`. In our case, the `AddReductionAssignment` denotes an accumulation via additions.
+
+**Pystencils requires type information about the reduction symbols and thus requires `r` to be a `TypedSymbol`.**
+
+The following reduction assignment classes are available in pystencils:    
+* `AddReductionAssignment`: Builds sum over elements
+* `SubReductionAssignment`: Builds difference over elements
+* `MulReductionAssignment`: Builds product over elements
+* `MinReductionAssignment`: Finds minimum element
+* `MaxReductionAssignment`: Finds maximum element
+
+:::{note}
+Alternatívely, you can also make use of the `reduction_assignment` function
+to specify reduction assignments:
+:::
+
+```{code-cell} ipython3
+from pystencils.sympyextensions import reduction_assignment
+from pystencils.sympyextensions.reduction import ReductionOp
+
+assign_sum = reduction_assignment(r, ReductionOp.Add, x.center())
+```
+
+For other reduction operations, the following enums can be passed to `reduction_assignment`.
+
+```{code-cell} python3
+class ReductionOp(Enum):
+    Add = "+"
+    Sub = "-"
+    Mul = "*"
+    Min = "min"
+    Max = "max"
+```
+
+## Generating and Running Reduction Kernels
+
+With the assignments being fully assembled, we can finally invoke the code generator and 
+create the kernel object via the {any}`create_kernel` function.
+
+### CPU Platforms
+
+For this example, we assume a kernel configuration for CPU platforms with no optimizations explicitly enabled.
+
+```{code-cell} ipython3
+cfg = ps.CreateKernelConfig(target=ps.Target.CurrentCPU)
+kernel = ps.create_kernel(assign_sum, cfg)
+
+ps.inspect(kernel)
+```
+
+:::{note}
+The generated reduction kernels may vary vastly for different platforms and optimizations.
+You can find a  detailed description of configuration choices and their impact on the generated code below.
+:::
+
+The kernel can be compiled and run immediately.
+
+To execute the kernel on CPUs, not only a {any}`numpy.ndarray` has to be passed for each field
+but also one for exporting reduction results. 
+The export mechanism can be seen in the previously generated code snippet. 
+Here, the kernel obtains a pointer with the name of the reduction symbol (here: `r`).
+This pointer is used for exporting the reduction result back from the kernel.
+Please note that the **values passed via pointer will not be overwritten** 
+but will be incorporated in the reduction computation.
+Since our reduction result is a single scalar value, it is sufficient to set up an array comprising a singular value.
+
+```{code-cell} ipython3
+    kernel_func = kernel.compile()
+
+    x_array = np.ones((4, 4, 4), dtype="float64")
+    reduction_result = np.zeros((1,), dtype="float64")
+
+    kernel_func(x=x_array, r=reduction_result)
+    
+    reduction_result[0]
+```
+
+### GPU Platforms
+
+Please note that **reductions are currently only supported for CUDA platforms**.
+Similar to the CPU section, a base variant for NVIDIA GPUs without 
+explicitly employing any optimizations is shown:
+
+```{code-cell} ipython3
+    cfg = ps.CreateKernelConfig(target=ps.Target.CUDA)
+
+    kernel_gpu = ps.create_kernel(assign_sum, cfg)
+
+    ps.inspect(kernel_gpu)
+```
+
+The steps for running the generated code on NVIDIA GPUs are identical but the fields and the write-back pointer 
+now require device memory, i.e. instances of {any}`cupy.ndarray`.
+
+## Optimizations for Reductions
+
+Going beyond the aforementioned basic kernel configurations,
+we now demonstrate optimization strategies for different platforms 
+that can be applied to reduction kernels and show what impact they have.
+
+### CPU Platforms
+
+For CPU platforms, standard optimizations are employing SIMD vectorization and shared-memory parallelism using OpenMP.
+The supported SIMD instruction sets for reductions are:
+* SSE3
+* AVX/AVX2
+* AVX512
+
+Below you can see that an AVX vectorization was employed by using the target `Target.X86_AVX`.
+**Note that reductions require `assume_inner_stride_one` to be enabled.**
+This is due to the fact that other inner strides would require masked SIMD operations 
+which are not supported yet.
+
+```{code-cell} ipython3
+# configure SIMD vectorization
+cfg = ps.CreateKernelConfig(
+  target=ps.Target.X86_AVX,
+)
+cfg.cpu.vectorize.enable = True
+cfg.cpu.vectorize.assume_inner_stride_one = True
+
+# configure OpenMP parallelization
+cfg.cpu.openmp.enable = True
+cfg.cpu.openmp.num_threads = 8
+
+kernel_cpu_opt = ps.create_kernel(assign_sum, cfg)
+
+ps.inspect(kernel_cpu_opt)
+```
+
+### GPU Platforms
+
+As evident from the generated kernel for the base variant, atomic operations are employed 
+for updating the pointer holding the reduction result.
+Using the *explicit warp-level instructions* provided by CUDA allows us to achieve higher performance compared to
+only using atomic operations.
+To generate kernels with warp-level reductions, the generator expects that CUDA block sizes are divisible by 
+the hardware's warp size.
+**Similar to the SIMD configuration, we assure the code generator that the configured block size fulfills this
+criterion by enabling `assume_warp_aligned_block_size`.**
+While the default block sizes provided by the code generator already fulfill this criterion,
+we employ a block fitting algorithm to obtain a block size that is also optimized for the kernel's iteration space.
+
+You can find more detailed information about warp size alignment in {ref}`gpu_codegen`.
+
+```{code-cell} ipython3
+    cfg = ps.CreateKernelConfig(target=ps.Target.CUDA)
+    cfg.gpu.assume_warp_aligned_block_size = True
+
+    kernel_gpu_opt = ps.create_kernel(assign_sum, cfg)
+    
+    kernel_func = kernel_gpu_opt.compile()
+    kernel_func.launch_config.fit_block_size((32, 1, 1))
+
+    ps.inspect(kernel_gpu_opt)
+```
diff --git a/src/pystencils/__init__.py b/src/pystencils/__init__.py
index 07283d5294bc08c8e68e6a40af0f956b36a0129a..329f61d32087df392406d17b5db6bcb520798317 100644
--- a/src/pystencils/__init__.py
+++ b/src/pystencils/__init__.py
@@ -1,10 +1,6 @@
 """Module to generate stencil kernels in C or CUDA using sympy expressions and call them as Python functions"""
 
-from .codegen import (
-    Target,
-    CreateKernelConfig,
-    AUTO
-)
+from .codegen import Target, CreateKernelConfig, AUTO
 from .defaults import DEFAULTS
 from . import fd
 from . import stencil as stencil
@@ -34,6 +30,13 @@ from .simp import AssignmentCollection
 from .sympyextensions.typed_sympy import TypedSymbol, DynamicType
 from .sympyextensions import SymbolCreator, tcast
 from .datahandling import create_data_handling
+from .sympyextensions.reduction import (
+    AddReductionAssignment,
+    SubReductionAssignment,
+    MulReductionAssignment,
+    MinReductionAssignment,
+    MaxReductionAssignment,
+)
 
 __all__ = [
     "Field",
@@ -61,6 +64,11 @@ __all__ = [
     "AssignmentCollection",
     "Assignment",
     "AddAugmentedAssignment",
+    "AddReductionAssignment",
+    "SubReductionAssignment",
+    "MulReductionAssignment",
+    "MinReductionAssignment",
+    "MaxReductionAssignment",
     "assignment_from_stencil",
     "SymbolCreator",
     "create_data_handling",
@@ -81,4 +89,5 @@ __all__ = [
 ]
 
 from . import _version
-__version__ = _version.get_versions()['version']
+
+__version__ = _version.get_versions()["version"]
diff --git a/src/pystencils/backend/ast/analysis.py b/src/pystencils/backend/ast/analysis.py
index edeba04f2b8e5d8727abe1150b9e574808e6811a..7032690a03dcf851a5b5227abd70c98301b907e7 100644
--- a/src/pystencils/backend/ast/analysis.py
+++ b/src/pystencils/backend/ast/analysis.py
@@ -62,7 +62,7 @@ class UndefinedSymbolsCollector:
 
             case PsAssignment(lhs, rhs):
                 undefined_vars = self(lhs) | self(rhs)
-                if isinstance(lhs, PsSymbolExpr):
+                if isinstance(node, PsDeclaration) and isinstance(lhs, PsSymbolExpr):
                     undefined_vars.remove(lhs.symbol)
                 return undefined_vars
 
diff --git a/src/pystencils/backend/ast/vector.py b/src/pystencils/backend/ast/vector.py
index 705d250949f3662695d506feeff30c20649eb1c5..d7ae8d6a94ddf747eeb8da9b8e97d350691fb2a9 100644
--- a/src/pystencils/backend/ast/vector.py
+++ b/src/pystencils/backend/ast/vector.py
@@ -3,8 +3,9 @@ from __future__ import annotations
 from typing import cast
 
 from .astnode import PsAstNode
-from .expressions import PsExpression, PsLvalue, PsUnOp
+from .expressions import PsExpression, PsLvalue, PsUnOp, PsBinOp
 from .util import failing_cast
+from ...sympyextensions import ReductionOp
 
 from ...types import PsVectorType
 
@@ -17,7 +18,7 @@ class PsVecBroadcast(PsUnOp, PsVectorOp):
     """Broadcast a scalar value to N vector lanes."""
 
     __match_args__ = ("lanes", "operand")
-    
+
     def __init__(self, lanes: int, operand: PsExpression):
         super().__init__(operand)
         self._lanes = lanes
@@ -25,26 +26,83 @@ class PsVecBroadcast(PsUnOp, PsVectorOp):
     @property
     def lanes(self) -> int:
         return self._lanes
-    
+
     @lanes.setter
     def lanes(self, n: int):
         self._lanes = n
 
     def _clone_expr(self) -> PsVecBroadcast:
         return PsVecBroadcast(self._lanes, self._operand.clone())
-    
+
     def structurally_equal(self, other: PsAstNode) -> bool:
         if not isinstance(other, PsVecBroadcast):
             return False
+        return super().structurally_equal(other) and self._lanes == other._lanes
+
+
+class PsVecHorizontal(PsBinOp, PsVectorOp):
+    """Perform a horizontal reduction across a vector onto a scalar base value.
+
+    **Example:** vec_horizontal_add(s, v)` will compute `s + v[0] + v[1] + ... + v[n-1]`.
+
+    Args:
+        scalar_operand: Scalar operand
+        vector_operand: Vector operand to be converted to a scalar value
+        reduction_op: Binary operation that is also used for the horizontal reduction
+    """
+
+    __match_args__ = ("scalar_operand", "vector_operand", "reduction_op")
+
+    def __init__(
+        self,
+        scalar_operand: PsExpression,
+        vector_operand: PsExpression,
+        reduction_op: ReductionOp,
+    ):
+        super().__init__(scalar_operand, vector_operand)
+        self._reduction_op = reduction_op
+
+    @property
+    def scalar_operand(self) -> PsExpression:
+        return self._op1
+
+    @scalar_operand.setter
+    def scalar_operand(self, op: PsExpression):
+        self._op1 = op
+
+    @property
+    def vector_operand(self) -> PsExpression:
+        return self._op2
+
+    @vector_operand.setter
+    def vector_operand(self, op: PsExpression):
+        self._op2 = op
+
+    @property
+    def reduction_op(self) -> ReductionOp:
+        return self._reduction_op
+
+    @reduction_op.setter
+    def reduction_op(self, op: ReductionOp):
+        self._reduction_op = op
+
+    def _clone_expr(self) -> PsVecHorizontal:
+        return PsVecHorizontal(
+            self._op1.clone(), self._op2.clone(), self._reduction_op
+        )
+
+    def structurally_equal(self, other: PsAstNode) -> bool:
+        if not isinstance(other, PsVecHorizontal):
+            return False
         return (
             super().structurally_equal(other)
-            and self._lanes == other._lanes
+            and self._reduction_op == other._reduction_op
         )
 
 
 class PsVecMemAcc(PsExpression, PsLvalue, PsVectorOp):
     """Pointer-based vectorized memory access.
-    
+
     Args:
         base_ptr: Pointer identifying the accessed memory region
         offset: Offset inside the memory region
@@ -95,7 +153,7 @@ class PsVecMemAcc(PsExpression, PsLvalue, PsVectorOp):
     @property
     def stride(self) -> PsExpression | None:
         return self._stride
-    
+
     @stride.setter
     def stride(self, expr: PsExpression | None):
         self._stride = expr
@@ -106,10 +164,12 @@ class PsVecMemAcc(PsExpression, PsLvalue, PsVectorOp):
 
     def get_vector_type(self) -> PsVectorType:
         return cast(PsVectorType, self._dtype)
-    
+
     def get_children(self) -> tuple[PsAstNode, ...]:
-        return (self._ptr, self._offset) + (() if self._stride is None else (self._stride,))
-    
+        return (self._ptr, self._offset) + (
+            () if self._stride is None else (self._stride,)
+        )
+
     def set_child(self, idx: int, c: PsAstNode):
         idx = [0, 1, 2][idx]
         match idx:
@@ -138,7 +198,7 @@ class PsVecMemAcc(PsExpression, PsLvalue, PsVectorOp):
             and self._vector_entries == other._vector_entries
             and self._aligned == other._aligned
         )
-    
+
     def __repr__(self) -> str:
         return (
             f"PsVecMemAcc({repr(self._ptr)}, {repr(self._offset)}, {repr(self._vector_entries)}, "
diff --git a/src/pystencils/backend/emission/ir_printer.py b/src/pystencils/backend/emission/ir_printer.py
index ffb65181ccd71ff95dffd6d006617dadc6809eea..5a3836d50db4a090b5088a0e6e3a7d895f9cd6ec 100644
--- a/src/pystencils/backend/emission/ir_printer.py
+++ b/src/pystencils/backend/emission/ir_printer.py
@@ -10,7 +10,7 @@ from .base_printer import BasePrinter, Ops, LR
 
 from ..ast import PsAstNode
 from ..ast.expressions import PsBufferAcc
-from ..ast.vector import PsVecMemAcc, PsVecBroadcast
+from ..ast.vector import PsVecMemAcc, PsVecBroadcast, PsVecHorizontal
 
 if TYPE_CHECKING:
     from ...codegen import Kernel
@@ -24,7 +24,7 @@ def emit_ir(ir: PsAstNode | Kernel):
 
 class IRAstPrinter(BasePrinter):
     """Print the IR AST as pseudo-code.
-    
+
     This printer produces a complete pseudocode representation of a pystencils AST.
     Other than the `CAstPrinter`, the `IRAstPrinter` is capable of emitting code for
     each node defined in `ast <pystencils.backend.ast>`.
@@ -77,6 +77,17 @@ class IRAstPrinter(BasePrinter):
                     f"vec_broadcast<{lanes}>({operand_code})", Ops.Weakest
                 )
 
+            case PsVecHorizontal(scalar_operand, vector_operand, reduction_op):
+                pc.push_op(Ops.Weakest, LR.Middle)
+                scalar_operand_code = self.visit(scalar_operand, pc)
+                vector_operand_code = self.visit(vector_operand, pc)
+                pc.pop_op()
+
+                return pc.parenthesize(
+                    f"vec_horizontal_{reduction_op.name.lower()}({scalar_operand_code, vector_operand_code})",
+                    Ops.Weakest,
+                )
+
             case _:
                 return super().visit(node, pc)
 
diff --git a/src/pystencils/backend/functions.py b/src/pystencils/backend/functions.py
index e815cbf99afff180e9c987ea053026aa18cd90cc..a3da9a1de1aa0639c5b04acfa3020bc551497b51 100644
--- a/src/pystencils/backend/functions.py
+++ b/src/pystencils/backend/functions.py
@@ -3,9 +3,9 @@ from typing import Any, Sequence, TYPE_CHECKING
 from abc import ABC
 from enum import Enum
 
-from ..types import PsType, PsNumericType
+from ..sympyextensions import ReductionOp
+from ..types import PsType, PsNumericType, PsTypeError
 from .exceptions import PsInternalCompilerError
-from ..types import PsTypeError
 
 if TYPE_CHECKING:
     from .ast.expressions import PsExpression
@@ -70,7 +70,7 @@ class MathFunctions(Enum):
 
 
 class PsMathFunction(PsFunction):
-    """Homogenously typed mathematical functions."""
+    """Homogeneously typed mathematical functions."""
 
     __match_args__ = ("func",)
 
@@ -95,6 +95,33 @@ class PsMathFunction(PsFunction):
         return hash(self._func)
 
 
+class PsReductionWriteBack(PsFunction):
+    """Function representing a reduction kernel's write-back step supported by the backend.
+
+    Each platform has to materialize this function to a concrete implementation.
+    """
+
+    def __init__(self, reduction_op: ReductionOp) -> None:
+        super().__init__("WriteBackToPtr", 2)
+        self._reduction_op = reduction_op
+
+    @property
+    def reduction_op(self) -> ReductionOp:
+        return self._reduction_op
+
+    def __str__(self) -> str:
+        return f"{super().name}"
+
+    def __eq__(self, other: object) -> bool:
+        if not isinstance(other, PsReductionWriteBack):
+            return False
+
+        return self._reduction_op == other._reduction_op
+
+    def __hash__(self) -> int:
+        return hash(self._reduction_op)
+
+
 class ConstantFunctions(Enum):
     """Numerical constant functions.
 
@@ -122,7 +149,7 @@ class PsConstantFunction(PsFunction):
     and will be broadcast by the vectorizer.
     """
 
-    __match_args__ = ("func,")
+    __match_args__ = ("func",)
 
     def __init__(
         self, func: ConstantFunctions, dtype: PsNumericType | None = None
diff --git a/src/pystencils/backend/kernelcreation/analysis.py b/src/pystencils/backend/kernelcreation/analysis.py
index 1365e1ef327afd756e17eac282efb7c0f8e65052..e5f8b921e78d703d52f507e5d577a9385adb2393 100644
--- a/src/pystencils/backend/kernelcreation/analysis.py
+++ b/src/pystencils/backend/kernelcreation/analysis.py
@@ -13,6 +13,8 @@ from ...simp import AssignmentCollection
 from sympy.codegen.ast import AssignmentBase
 
 from ..exceptions import PsInternalCompilerError, KernelConstraintsError
+from ...sympyextensions.reduction import ReductionAssignment
+from ...sympyextensions.typed_sympy import TypedSymbol
 
 
 class KernelAnalysis:
@@ -54,6 +56,8 @@ class KernelAnalysis:
         self._check_access_independence = check_access_independence
         self._check_double_writes = check_double_writes
 
+        self._reduction_symbols: set[TypedSymbol] = set()
+
         #   Map pairs of fields and indices to offsets
         self._field_writes: dict[KernelAnalysis.FieldAndIndex, set[Any]] = defaultdict(
             set
@@ -88,6 +92,14 @@ class KernelAnalysis:
                 for asm in asms:
                     self._visit(asm)
 
+            case ReductionAssignment():
+                assert isinstance(obj.lhs, TypedSymbol)
+
+                self._reduction_symbols.add(obj.lhs)
+
+                self._handle_rhs(obj.rhs)
+                self._handle_lhs(obj.lhs)
+
             case AssignmentBase():
                 self._handle_rhs(obj.rhs)
                 self._handle_lhs(obj.lhs)
@@ -152,6 +164,11 @@ class KernelAnalysis:
                                     f"{field} is read at {offsets} and written at {write_offset}"
                                 )
                 case sp.Symbol():
+                    if expr in self._reduction_symbols:
+                        raise KernelConstraintsError(
+                            f"Illegal access to reduction symbol {expr.name} outside of ReductionAssignment. "
+                        )
+
                     self._scopes.access_symbol(expr)
 
             for arg in expr.args:
diff --git a/src/pystencils/backend/kernelcreation/context.py b/src/pystencils/backend/kernelcreation/context.py
index 68da893ff6204c73d61dcebddb1da602f37520c7..319d6061cfc9ff1baf6068b62b7cc0699a75ff82 100644
--- a/src/pystencils/backend/kernelcreation/context.py
+++ b/src/pystencils/backend/kernelcreation/context.py
@@ -1,12 +1,16 @@
 from __future__ import annotations
 
+from dataclasses import dataclass
 from typing import Iterable, Iterator, Any
 from itertools import chain, count
 from collections import namedtuple, defaultdict
 import re
 
+from ..ast.expressions import PsExpression, PsConstantExpr, PsCall
+from ..functions import PsConstantFunction, ConstantFunctions
 from ...defaults import DEFAULTS
 from ...field import Field, FieldType
+from ...sympyextensions import ReductionOp
 from ...sympyextensions.typed_sympy import TypedSymbol, DynamicType
 
 from ..memory import PsSymbol, PsBuffer
@@ -44,6 +48,26 @@ class FieldsInKernel:
 FieldArrayPair = namedtuple("FieldArrayPair", ("field", "array"))
 
 
+@dataclass(frozen=True)
+class ReductionInfo:
+    """Information about a reduction operation, its neutral element in form of an initial value
+    and the pointer used by the kernel as write-back argument.
+
+    Attributes:
+    ===========
+
+    reduction_op : Reduction operation being performed
+    init_val : Initial value used to initialize local symbol
+    local_symbol : Kernel-local symbol used to accumulate intermediate reduction result
+    writeback_ptr_symbol : Symbol that is used to export the final reduction result
+    """
+
+    op: ReductionOp
+    init_val: PsExpression
+    local_symbol: PsSymbol
+    writeback_ptr_symbol: PsSymbol
+
+
 class KernelCreationContext:
     """Manages the translation process from the SymPy frontend to the backend AST, and collects
     all necessary information for the translation:
@@ -75,6 +99,8 @@ class KernelCreationContext:
         self._symbol_ctr_pattern = re.compile(r"__[0-9]+$")
         self._symbol_dup_table: defaultdict[str, int] = defaultdict(lambda: 0)
 
+        self._reduction_data: dict[str, ReductionInfo] = dict()
+
         self._fields_and_arrays: dict[str, FieldArrayPair] = dict()
         self._fields_collection = FieldsInKernel()
 
@@ -93,7 +119,7 @@ class KernelCreationContext:
     def index_dtype(self) -> PsIntegerType:
         """Data type used by default for index expressions"""
         return self._index_dtype
-    
+
     def resolve_dynamic_type(self, dtype: DynamicType | PsType) -> PsType:
         """Selects the appropriate data type for `DynamicType` instances, and returns all other types as they are."""
         match dtype:
@@ -178,6 +204,61 @@ class KernelCreationContext:
 
         self._symbols[old.name] = new
 
+    def add_reduction_info(
+        self,
+        lhs_name: str,
+        lhs_dtype: PsNumericType,
+        reduction_op: ReductionOp,
+    ):
+        """Create ReductionInfo instance and add to its corresponding lookup table for a given symbol name."""
+
+        # make sure that lhs symbol never occurred before ReductionAssignment
+        if self.find_symbol(lhs_name):
+            raise KernelConstraintsError(
+                f"Cannot create reduction with symbol {lhs_name}: "
+                "Another symbol with the same name already exist."
+            )
+
+        # add symbol for lhs with pointer datatype for write-back mechanism
+        pointer_symb = self.get_symbol(lhs_name, PsPointerType(lhs_dtype))
+
+        # create kernel-local copy of lhs symbol
+        local_symb = self.get_new_symbol(f"{lhs_name}_local", lhs_dtype)
+
+        # match for reduction operation and set neutral init_val
+        init_val: PsExpression
+        match reduction_op:
+            case ReductionOp.Add:
+                init_val = PsConstantExpr(PsConstant(0))
+            case ReductionOp.Sub:
+                init_val = PsConstantExpr(PsConstant(0))
+            case ReductionOp.Mul:
+                init_val = PsConstantExpr(PsConstant(1))
+            case ReductionOp.Min:
+                init_val = PsCall(PsConstantFunction(ConstantFunctions.PosInfinity), [])
+            case ReductionOp.Max:
+                init_val = PsCall(PsConstantFunction(ConstantFunctions.NegInfinity), [])
+            case _:
+                raise PsInternalCompilerError(
+                    f"Unsupported kind of reduction assignment: {reduction_op}."
+                )
+
+        # create reduction info and add to set
+        reduction_info = ReductionInfo(
+            reduction_op, init_val, local_symb, pointer_symb
+        )
+        self._reduction_data[lhs_name] = reduction_info
+
+        return reduction_info
+
+    def find_reduction_info(self, name: str) -> ReductionInfo | None:
+        """Find a ReductionInfo with the given name in the lookup table, if it exists.
+
+        Returns:
+            The ReductionInfo with the given name, or `None` if it does not exist.
+        """
+        return self._reduction_data.get(name, None)
+
     def duplicate_symbol(
         self, symb: PsSymbol, new_dtype: PsType | None = None
     ) -> PsSymbol:
@@ -213,6 +294,11 @@ class KernelCreationContext:
         """Return an iterable of all symbols listed in the symbol table."""
         return self._symbols.values()
 
+    @property
+    def reduction_data(self) -> dict[str, ReductionInfo]:
+        """Return a dictionary holding kernel-local reduction information for given symbol names."""
+        return self._reduction_data
+
     #   Fields and Arrays
 
     @property
diff --git a/src/pystencils/backend/kernelcreation/freeze.py b/src/pystencils/backend/kernelcreation/freeze.py
index 0737d4c740a4371997ad8963012300bd67116dc2..0fb1bd9865bca1718336b9cf3c6cde4d91397af2 100644
--- a/src/pystencils/backend/kernelcreation/freeze.py
+++ b/src/pystencils/backend/kernelcreation/freeze.py
@@ -1,9 +1,8 @@
 from typing import overload, cast, Any
 from functools import reduce
-from operator import add, mul, sub, truediv
+from operator import add, mul, sub
 
 import sympy as sp
-import sympy.core.relational
 import sympy.logic.boolalg
 from sympy.codegen.ast import AssignmentBase, AugmentedAssignment
 
@@ -13,8 +12,10 @@ from ...sympyextensions import (
     integer_functions,
     ConditionalFieldAccess,
 )
+from ..reduction_op_mapping import reduction_op_to_expr
 from ...sympyextensions.typed_sympy import TypedSymbol, TypeCast, DynamicType
 from ...sympyextensions.pointers import AddressOf, mem_acc
+from ...sympyextensions.reduction import ReductionAssignment, ReductionOp
 from ...sympyextensions.bit_masks import bit_conditional
 from ...field import Field, FieldType
 
@@ -178,19 +179,46 @@ class FreezeExpressions:
         assert isinstance(lhs, PsExpression)
         assert isinstance(rhs, PsExpression)
 
-        match expr.op:
-            case "+=":
-                op = add
-            case "-=":
-                op = sub
-            case "*=":
-                op = mul
-            case "/=":
-                op = truediv
-            case _:
-                raise FreezeError(f"Unsupported augmented assignment: {expr.op}.")
+        # transform augmented assignment to reduction op
+        str_to_reduction_op: dict[str, ReductionOp] = {
+            "+=": ReductionOp.Add,
+            "-=": ReductionOp.Sub,
+            "*=": ReductionOp.Mul,
+            "/=": ReductionOp.Div,
+        }
+
+        # reuse existing handling for transforming reduction ops to expressions
+        return PsAssignment(
+            lhs, reduction_op_to_expr(str_to_reduction_op[expr.op], lhs.clone(), rhs)
+        )
 
-        return PsAssignment(lhs, op(lhs.clone(), rhs))
+    def map_ReductionAssignment(self, expr: ReductionAssignment):
+        assert isinstance(expr.lhs, TypedSymbol)
+
+        rhs = self.visit(expr.rhs)
+
+        assert isinstance(rhs, PsExpression)
+
+        reduction_op = expr.reduction_op
+        lhs_symbol = expr.lhs
+        lhs_dtype = lhs_symbol.dtype
+        lhs_name = lhs_symbol.name
+
+        if not isinstance(lhs_dtype, PsNumericType):
+            raise FreezeError("Reduction symbol must have a numeric data type.")
+
+        # get reduction info from context
+        reduction_info = self._ctx.add_reduction_info(
+            lhs_name, lhs_dtype, reduction_op
+        )
+
+        # create new lhs from newly created local lhs symbol
+        new_lhs = PsSymbolExpr(reduction_info.local_symbol)
+
+        # get new rhs from augmented assignment
+        new_rhs: PsExpression = reduction_op_to_expr(reduction_op, new_lhs, rhs)
+
+        return PsAssignment(new_lhs, new_rhs)
 
     def map_Symbol(self, spsym: sp.Symbol) -> PsSymbolExpr:
         symb = self._ctx.get_symbol(spsym.name)
@@ -563,7 +591,7 @@ class FreezeExpressions:
     def map_Not(self, neg: sympy.logic.Not) -> PsNot:
         arg = self.visit_expr(neg.args[0])
         return PsNot(arg)
-    
+
     def map_bit_conditional(self, conditional: bit_conditional):
         args = [self.visit_expr(arg) for arg in conditional.args]
         bitpos, mask, then_expr = args[:3]
diff --git a/src/pystencils/backend/kernelcreation/typification.py b/src/pystencils/backend/kernelcreation/typification.py
index e7a19bb50a6e11cbcbb45761a1aa35d42c40225d..c966262eaa5026682cfc98cd9a4d6955403668c8 100644
--- a/src/pystencils/backend/kernelcreation/typification.py
+++ b/src/pystencils/backend/kernelcreation/typification.py
@@ -49,7 +49,7 @@ from ..ast.expressions import (
     PsNeg,
     PsNot,
 )
-from ..ast.vector import PsVecBroadcast, PsVecMemAcc
+from ..ast.vector import PsVecBroadcast, PsVecMemAcc, PsVecHorizontal
 from ..functions import PsMathFunction, CFunction, PsConstantFunction
 from ..ast.util import determine_memory_object
 from ..exceptions import TypificationError
@@ -245,7 +245,7 @@ class TypeContext:
                         f"    Expression: {expr}"
                         f"  Type Context: {self._target_type}"
                     )
-                
+
                 case PsCast(cast_target, _) if cast_target is None:
                     expr.target_type = self._target_type
         # endif
@@ -604,6 +604,38 @@ class Typifier:
                 else:
                     tc.apply_dtype(PsBoolType(), expr)
 
+            case PsVecHorizontal():
+                # bin op consisting of a scalar and a vector that is converted to a scalar
+                # -> whole expression should be treated as scalar
+
+                self.visit_expr(expr.scalar_operand, tc)
+
+                vector_op_tc = TypeContext()
+                self.visit_expr(expr.vector_operand, vector_op_tc)
+
+                if tc.target_type is None or vector_op_tc.target_type is None:
+                    raise TypificationError(
+                        f"Unable to determine type of argument to vector horizontal: {expr}"
+                    )
+
+                if not isinstance(tc.target_type, PsScalarType):
+                    raise TypificationError(
+                        f"Illegal type in scalar operand (op1) to vector horizontal: {tc.target_type}"
+                    )
+
+                if not isinstance(vector_op_tc.target_type, PsVectorType):
+                    raise TypificationError(
+                        f"Illegal type in vector operand (op2) to vector horizontal: {vector_op_tc.target_type}"
+                    )
+
+                if vector_op_tc.target_type.scalar_type is not tc.target_type:
+                    raise TypificationError(
+                        f"Scalar type of vector operand {vector_op_tc.target_type} "
+                        f"does not correspond to type of scalar operand {tc.target_type}"
+                    )
+
+                tc.apply_dtype(tc.target_type, expr)
+
             case PsBinOp(op1, op2):
                 self.visit_expr(op1, tc)
                 self.visit_expr(op2, tc)
diff --git a/src/pystencils/backend/platforms/cuda.py b/src/pystencils/backend/platforms/cuda.py
index 4e5c5f275dd7cc931899c5a5f72ad3199522d711..6e6488ee1b461aac3694573835488d88760af361 100644
--- a/src/pystencils/backend/platforms/cuda.py
+++ b/src/pystencils/backend/platforms/cuda.py
@@ -1,11 +1,119 @@
 from __future__ import annotations
 
+import math
+
 from .generic_gpu import GenericGpu
+from ..ast import PsAstNode
+from ..ast.expressions import (
+    PsExpression,
+    PsLiteralExpr,
+    PsCall,
+    PsAnd,
+    PsConstantExpr,
+    PsSymbolExpr,
+)
+from ..ast.structural import (
+    PsConditional,
+    PsStatement,
+    PsAssignment,
+    PsBlock,
+    PsStructuralNode,
+)
+from ..constants import PsConstant
+from ..functions import CFunction
+from ..literals import PsLiteral
+from ..reduction_op_mapping import reduction_op_to_expr
+from ...sympyextensions import ReductionOp
+from ...types import PsIeeeFloatType, PsCustomType, PsPointerType, PsScalarType
+from ...types.quick import SInt, UInt
 
 
 class CudaPlatform(GenericGpu):
-    """Platform for the CUDA GPU taret."""
+    """Platform for the CUDA GPU target."""
 
     @property
     def required_headers(self) -> set[str]:
         return super().required_headers | {'"pystencils_runtime/cuda.cuh"'}
+
+    def resolve_reduction(
+        self,
+        ptr_expr: PsExpression,
+        symbol_expr: PsExpression,
+        reduction_op: ReductionOp,
+    ) -> tuple[tuple[PsStructuralNode, ...], PsAstNode]:
+        stype = symbol_expr.dtype
+        ptrtype = ptr_expr.dtype
+
+        assert isinstance(ptr_expr, PsSymbolExpr) and isinstance(ptrtype, PsPointerType)
+        assert isinstance(symbol_expr, PsSymbolExpr) and isinstance(stype, PsScalarType)
+
+        if not isinstance(stype, PsIeeeFloatType) or stype.width not in (32, 64):
+            NotImplementedError(
+                "atomic operations are only available for float32/64 datatypes"
+            )
+
+        # workaround for subtractions -> use additions for reducing intermediate results
+        # similar to OpenMP reductions: local copies (negative sign) are added at the end
+        match reduction_op:
+            case ReductionOp.Sub:
+                actual_reduction_op = ReductionOp.Add
+            case _:
+                actual_reduction_op = reduction_op
+
+        # check if thread is valid for performing reduction
+        ispace = self._ctx.get_iteration_space()
+        is_valid_thread = self._get_condition_for_translation(ispace)
+
+        cond: PsExpression
+        shuffles: tuple[PsAssignment, ...]
+        if self._warp_size and self._assume_warp_aligned_block_size:
+            # perform local warp reductions
+            def gen_shuffle_instr(offset: int):
+                full_mask = PsLiteralExpr(PsLiteral("0xffffffff", UInt(32)))
+                return PsCall(
+                    CFunction("__shfl_xor_sync", [UInt(32), stype, SInt(32)], stype),
+                    [
+                        full_mask,
+                        symbol_expr,
+                        PsConstantExpr(PsConstant(offset, SInt(32))),
+                    ],
+                )
+
+            # set up shuffle instructions for warp-level reduction
+            num_shuffles = math.frexp(self._warp_size)[1]
+            shuffles = tuple(
+                PsAssignment(
+                    symbol_expr,
+                    reduction_op_to_expr(
+                        actual_reduction_op,
+                        symbol_expr,
+                        gen_shuffle_instr(pow(2, i - 1)),
+                    ),
+                )
+                for i in reversed(range(1, num_shuffles))
+            )
+
+            # find first thread in warp
+            first_thread_in_warp = self._first_thread_in_warp(ispace)
+
+            # set condition to only execute atomic operation on first valid thread in warp
+            cond = (
+                PsAnd(is_valid_thread, first_thread_in_warp)
+                if is_valid_thread
+                else first_thread_in_warp
+            )
+        else:
+            # no optimization: only execute atomic add on valid thread
+            shuffles = ()
+            cond = is_valid_thread
+
+        # use atomic operation
+        func = CFunction(
+            f"atomic{actual_reduction_op.name}", [ptrtype, stype], PsCustomType("void")
+        )
+        func_args = (ptr_expr, symbol_expr)
+
+        # assemble warp reduction
+        return shuffles, PsConditional(
+            cond, PsBlock([PsStatement(PsCall(func, func_args))])
+        )
diff --git a/src/pystencils/backend/platforms/generic_cpu.py b/src/pystencils/backend/platforms/generic_cpu.py
index 20167ffd6c7ae926e26fbb266e1d5bb300f76cc9..cbde419d4eff508915b607df8a699632c1ba4416 100644
--- a/src/pystencils/backend/platforms/generic_cpu.py
+++ b/src/pystencils/backend/platforms/generic_cpu.py
@@ -2,16 +2,20 @@ from abc import ABC, abstractmethod
 from typing import Sequence
 import numpy as np
 
-from pystencils.backend.ast.expressions import PsCall
+from ..ast.expressions import PsCall, PsMemAcc, PsConstantExpr
 
+from ..ast import PsAstNode
 from ..functions import (
     CFunction,
-    PsMathFunction,
     MathFunctions,
+    PsMathFunction,
     PsConstantFunction,
     ConstantFunctions,
+    PsReductionWriteBack,
 )
-from ...types import PsIntegerType, PsIeeeFloatType
+from ..reduction_op_mapping import reduction_op_to_expr
+from ...sympyextensions import ReductionOp
+from ...types import PsIntegerType, PsIeeeFloatType, PsScalarType, PsPointerType
 
 from .platform import Platform
 from ..exceptions import MaterializationError
@@ -24,7 +28,7 @@ from ..kernelcreation.iteration_space import (
 )
 
 from ..constants import PsConstant
-from ..ast.structural import PsDeclaration, PsLoop, PsBlock
+from ..ast.structural import PsDeclaration, PsLoop, PsBlock, PsStructuralNode
 from ..ast.expressions import (
     PsSymbolExpr,
     PsExpression,
@@ -60,11 +64,40 @@ class GenericCpu(Platform):
         else:
             raise MaterializationError(f"Unknown type of iteration space: {ispace}")
 
-    def select_function(self, call: PsCall) -> PsExpression:
-        assert isinstance(call.function, (PsMathFunction | PsConstantFunction))
+    def select_function(
+        self, call: PsCall
+    ) -> PsExpression | tuple[tuple[PsStructuralNode, ...], PsAstNode]:
+        call_func = call.function
+        assert isinstance(call_func, (PsReductionWriteBack | PsMathFunction | PsConstantFunction))
+
+        if isinstance(call_func, PsReductionWriteBack):
+            ptr_expr, symbol_expr = call.args
+            op = call_func.reduction_op
+
+            assert isinstance(ptr_expr, PsSymbolExpr) and isinstance(
+                ptr_expr.dtype, PsPointerType
+            )
+            assert isinstance(symbol_expr, PsSymbolExpr) and isinstance(
+                symbol_expr.dtype, PsScalarType
+            )
+
+            ptr_access = PsMemAcc(
+                ptr_expr, PsConstantExpr(PsConstant(0, self._ctx.index_dtype))
+            )
+
+            # inspired by OpenMP: local reduction variable (negative sign) is added at the end
+            actual_op = ReductionOp.Add if op is ReductionOp.Sub else op
+
+            # create binop and potentially select corresponding function for e.g. min or max
+            potential_call = reduction_op_to_expr(actual_op, ptr_access, symbol_expr)
+            if isinstance(potential_call, PsCall):
+                potential_call.dtype = symbol_expr.dtype
+                return self.select_function(potential_call)
+
+            return potential_call
 
-        func = call.function.func
         dtype = call.get_dtype()
+        func = call_func.func
         arg_types = (dtype,) * call.function.arg_count
 
         expr: PsExpression | None = None
diff --git a/src/pystencils/backend/platforms/generic_gpu.py b/src/pystencils/backend/platforms/generic_gpu.py
index 60019c4cb731c1f4c0f2e6da30e4d64006c54f67..876da4c912c68fed993b7d864e69748c47345277 100644
--- a/src/pystencils/backend/platforms/generic_gpu.py
+++ b/src/pystencils/backend/platforms/generic_gpu.py
@@ -1,14 +1,23 @@
 from __future__ import annotations
+
+import operator
 from abc import ABC, abstractmethod
+from functools import reduce
 import numpy as np
 
-from ...types import constify, deconstify, PsIntegerType
+from ..ast import PsAstNode
+from ...sympyextensions.reduction import ReductionOp
+from ...types import (
+    constify,
+    deconstify,
+    PsIntegerType,
+)
+from ...types.quick import SInt
 from ..exceptions import MaterializationError
 from .platform import Platform
 
 from ..memory import PsSymbol
 from ..kernelcreation import (
-    KernelCreationContext,
     Typifier,
     IterationSpace,
     FullIterationSpace,
@@ -17,7 +26,13 @@ from ..kernelcreation import (
 )
 
 from ..constants import PsConstant
-from ..ast.structural import PsBlock, PsConditional, PsDeclaration
+from ..kernelcreation.context import KernelCreationContext
+from ..ast.structural import (
+    PsBlock,
+    PsConditional,
+    PsDeclaration,
+    PsStructuralNode,
+)
 from ..ast.expressions import (
     PsExpression,
     PsLiteralExpr,
@@ -25,19 +40,24 @@ from ..ast.expressions import (
     PsCall,
     PsLookup,
     PsBufferAcc,
+    PsConstantExpr,
+    PsAdd,
+    PsRem,
+    PsEq,
 )
 from ..ast.expressions import PsLt, PsAnd
 from ...types import PsSignedIntegerType, PsIeeeFloatType
 from ..literals import PsLiteral
+
 from ..functions import (
-    PsMathFunction,
     MathFunctions,
     CFunction,
+    PsReductionWriteBack,
+    PsMathFunction,
     PsConstantFunction,
     ConstantFunctions,
 )
 
-
 int32 = PsSignedIntegerType(width=32, const=False)
 
 BLOCK_IDX = [
@@ -179,23 +199,38 @@ class GenericGpu(Platform):
         thread_mapping: Callback object which defines the mapping of thread indices onto iteration space points
     """
 
+    @property
+    @abstractmethod
+    def required_headers(self) -> set[str]:
+        return set()
+
+    @abstractmethod
+    def resolve_reduction(
+        self,
+        ptr_expr: PsExpression,
+        symbol_expr: PsExpression,
+        reduction_op: ReductionOp,
+    ) -> tuple[tuple[PsStructuralNode, ...], PsAstNode]:
+        pass
+
     def __init__(
         self,
         ctx: KernelCreationContext,
+        assume_warp_aligned_block_size: bool,
+        warp_size: int | None,
         thread_mapping: ThreadMapping | None = None,
     ) -> None:
         super().__init__(ctx)
 
+        self._assume_warp_aligned_block_size = assume_warp_aligned_block_size
+        self._warp_size = warp_size
+
         self._thread_mapping = (
             thread_mapping if thread_mapping is not None else Linear3DMapping()
         )
 
         self._typify = Typifier(ctx)
 
-    @property
-    def required_headers(self) -> set[str]:
-        return set()
-
     def materialize_iteration_space(
         self, body: PsBlock, ispace: IterationSpace
     ) -> PsBlock:
@@ -206,11 +241,70 @@ class GenericGpu(Platform):
         else:
             raise MaterializationError(f"Unknown type of iteration space: {ispace}")
 
-    def select_function(self, call: PsCall) -> PsExpression:
-        assert isinstance(call.function, (PsMathFunction | PsConstantFunction))
+    @staticmethod
+    def _get_condition_for_translation(ispace: IterationSpace):
+
+        if isinstance(ispace, FullIterationSpace):
+            conds = []
+
+            dimensions = ispace.dimensions_in_loop_order()
+
+            for dim in dimensions:
+                ctr_expr = PsExpression.make(dim.counter)
+                conds.append(PsLt(ctr_expr, dim.stop))
+
+            condition: PsExpression = conds[0]
+            for cond in conds[1:]:
+                condition = PsAnd(condition, cond)
+
+            return condition
+        elif isinstance(ispace, SparseIterationSpace):
+            sparse_ctr_expr = PsExpression.make(ispace.sparse_counter)
+            stop = PsExpression.make(ispace.index_list.shape[0])
+
+            return PsLt(sparse_ctr_expr.clone(), stop)
+        else:
+            raise MaterializationError(f"Unknown type of iteration space: {ispace}")
+
+    @staticmethod
+    def _thread_index_per_dim(ispace: IterationSpace) -> tuple[PsExpression, ...]:
+        """Returns thread indices multiplied with block dimension strides per dimension."""
+
+        return tuple(
+            idx
+            * PsConstantExpr(
+                PsConstant(reduce(operator.mul, BLOCK_DIM[:i], 1), SInt(32))
+            )
+            for i, idx in enumerate(THREAD_IDX[: ispace.rank])
+        )
+
+    def _first_thread_in_warp(self, ispace: IterationSpace) -> PsExpression:
+        """Returns expression that determines whether a thread is the first within a warp."""
+
+        tids_per_dim = GenericGpu._thread_index_per_dim(ispace)
+        tid: PsExpression = tids_per_dim[0]
+        for t in tids_per_dim[1:]:
+            tid = PsAdd(tid, t)
+
+        return PsEq(
+            PsRem(tid, PsConstantExpr(PsConstant(self._warp_size, SInt(32)))),
+            PsConstantExpr(PsConstant(0, SInt(32))),
+        )
+
+    def select_function(
+        self, call: PsCall
+    ) -> PsExpression | tuple[tuple[PsStructuralNode, ...], PsAstNode]:
+        call_func = call.function
+        assert isinstance(call_func, (PsReductionWriteBack | PsMathFunction | PsConstantFunction))
+
+        if isinstance(call_func, PsReductionWriteBack):
+            ptr_expr, symbol_expr = call.args
+            op = call_func.reduction_op
+
+            return self.resolve_reduction(ptr_expr, symbol_expr, op)
 
-        func = call.function.func
         dtype = call.get_dtype()
+        func = call_func.func
         arg_types = (dtype,) * call.function.arg_count
         expr: PsExpression | None = None
 
@@ -273,7 +367,7 @@ class GenericGpu(Platform):
 
                 case ConstantFunctions.PosInfinity:
                     expr = PsExpression.make(PsLiteral(f"PS_FP{dtype.width}_INFINITY", dtype))
-                    
+
                 case ConstantFunctions.NegInfinity:
                     expr = PsExpression.make(PsLiteral(f"PS_FP{dtype.width}_NEG_INFINITY", dtype))
 
@@ -285,7 +379,7 @@ class GenericGpu(Platform):
         if isinstance(dtype, PsIntegerType):
             expr = self._select_integer_function(call)
 
-        if expr is not None:    
+        if expr is not None:
             if expr.dtype is None:
                 typify = Typifier(self._ctx)
                 typify(expr)
@@ -304,7 +398,6 @@ class GenericGpu(Platform):
         ctr_mapping = self._thread_mapping(ispace)
 
         indexing_decls = []
-        conds = []
 
         dimensions = ispace.dimensions_in_loop_order()
 
@@ -318,14 +411,9 @@ class GenericGpu(Platform):
             indexing_decls.append(
                 self._typify(PsDeclaration(ctr_expr, ctr_mapping[dim.counter]))
             )
-            conds.append(PsLt(ctr_expr, dim.stop))
 
-        condition: PsExpression = conds[0]
-        for cond in conds[1:]:
-            condition = PsAnd(condition, cond)
-        ast = PsBlock(indexing_decls + [PsConditional(condition, body)])
-
-        return ast
+        cond = self._get_condition_for_translation(ispace)
+        return PsBlock(indexing_decls + [PsConditional(cond, body)])
 
     def _prepend_sparse_translation(
         self, body: PsBlock, ispace: SparseIterationSpace
@@ -355,8 +443,5 @@ class GenericGpu(Platform):
         ]
         body.statements = mappings + body.statements
 
-        stop = PsExpression.make(ispace.index_list.shape[0])
-        condition = PsLt(sparse_ctr_expr.clone(), stop)
-        ast = PsBlock([sparse_idx_decl, PsConditional(condition, body)])
-
-        return ast
+        cond = self._get_condition_for_translation(ispace)
+        return PsBlock([sparse_idx_decl, PsConditional(cond, body)])
diff --git a/src/pystencils/backend/platforms/hip.py b/src/pystencils/backend/platforms/hip.py
index 6c2c7979813843f8f0fd8a8538decde64bd53c87..b712ac69951639957f2d946549d4dafc53700c99 100644
--- a/src/pystencils/backend/platforms/hip.py
+++ b/src/pystencils/backend/platforms/hip.py
@@ -1,11 +1,25 @@
 from __future__ import annotations
 
 from .generic_gpu import GenericGpu
+from ..ast import PsAstNode
+from ..ast.expressions import PsExpression
+from ..ast.structural import PsStructuralNode
+from ..exceptions import MaterializationError
+from ...sympyextensions import ReductionOp
 
 
 class HipPlatform(GenericGpu):
-    """Platform for the HIP GPU taret."""
+    """Platform for the HIP GPU target."""
 
     @property
     def required_headers(self) -> set[str]:
         return super().required_headers | {'"pystencils_runtime/hip.h"'}
+
+    def resolve_reduction(
+        self,
+        ptr_expr: PsExpression,
+        symbol_expr: PsExpression,
+        reduction_op: ReductionOp,
+    ) -> tuple[tuple[PsStructuralNode, ...], PsAstNode]:
+
+        raise MaterializationError("Reductions are yet not supported in HIP backend.")
diff --git a/src/pystencils/backend/platforms/platform.py b/src/pystencils/backend/platforms/platform.py
index a3e0b63289a2d4f7a459b8a6ad90f505ed99eb33..f66b15741407ccbc6d14ee168d1ea0bc92334b39 100644
--- a/src/pystencils/backend/platforms/platform.py
+++ b/src/pystencils/backend/platforms/platform.py
@@ -1,7 +1,8 @@
 from abc import ABC, abstractmethod
 
+from ..ast import PsAstNode
 from ...types import PsIntegerType
-from ..ast.structural import PsBlock
+from ..ast.structural import PsBlock, PsStructuralNode
 from ..ast.expressions import PsCall, PsExpression, PsTernary, PsGe, PsLe
 from ..functions import PsMathFunction, MathFunctions
 from ..constants import PsConstant
@@ -38,7 +39,9 @@ class Platform(ABC):
         pass
 
     @abstractmethod
-    def select_function(self, call: PsCall) -> PsExpression:
+    def select_function(
+        self, call: PsCall
+    ) -> PsExpression | tuple[tuple[PsStructuralNode, ...], PsAstNode]:
         """Select an implementation for the given function on the given data type.
 
         If no viable implementation exists, raise a `MaterializationError`.
diff --git a/src/pystencils/backend/platforms/sycl.py b/src/pystencils/backend/platforms/sycl.py
index 4f018a21b034cdd40ef0fb5cfff0761b72fe2919..dd043f06be79a645cc2d80ec37233f659b846273 100644
--- a/src/pystencils/backend/platforms/sycl.py
+++ b/src/pystencils/backend/platforms/sycl.py
@@ -1,12 +1,13 @@
 from __future__ import annotations
 
+from ..ast import PsAstNode
 from ..functions import CFunction, PsMathFunction, MathFunctions
 from ..kernelcreation.iteration_space import (
     IterationSpace,
     FullIterationSpace,
     SparseIterationSpace,
 )
-from ..ast.structural import PsDeclaration, PsBlock, PsConditional
+from ..ast.structural import PsDeclaration, PsBlock, PsConditional, PsStructuralNode
 from ..ast.expressions import (
     PsExpression,
     PsSymbolExpr,
@@ -52,7 +53,9 @@ class SyclPlatform(Platform):
         else:
             raise MaterializationError(f"Unknown type of iteration space: {ispace}")
 
-    def select_function(self, call: PsCall) -> PsExpression:
+    def select_function(
+        self, call: PsCall
+    ) -> PsExpression | tuple[tuple[PsStructuralNode, ...], PsAstNode]:
         assert isinstance(call.function, PsMathFunction)
 
         func = call.function.func
diff --git a/src/pystencils/backend/platforms/x86.py b/src/pystencils/backend/platforms/x86.py
index 3f0285f0c9d75e1df00d1b662ade818a8b50a9cc..16dbc14c4c077cd16a566ca1f90c42fdafb358d8 100644
--- a/src/pystencils/backend/platforms/x86.py
+++ b/src/pystencils/backend/platforms/x86.py
@@ -1,5 +1,5 @@
 from __future__ import annotations
-from typing import Sequence
+from typing import Sequence, Tuple
 from enum import Enum
 from functools import cache
 
@@ -17,8 +17,9 @@ from ..ast.expressions import (
     PsCast,
     PsCall,
 )
-from ..ast.vector import PsVecMemAcc, PsVecBroadcast
-from ...types import PsCustomType, PsVectorType, PsPointerType
+from ..ast.vector import PsVecMemAcc, PsVecBroadcast, PsVecHorizontal
+from ...sympyextensions import ReductionOp
+from ...types import PsCustomType, PsVectorType, PsPointerType, PsType
 from ..constants import PsConstant
 
 from ..exceptions import MaterializationError
@@ -132,6 +133,8 @@ class X86VectorCpu(GenericVectorCpu):
         else:
             headers = {"<immintrin.h>"}
 
+        headers.update({'"pystencils_runtime/simd_horizontal_helpers.h"'})
+
         return super().required_headers | headers
 
     def type_intrinsic(self, vector_type: PsVectorType) -> PsCustomType:
@@ -160,7 +163,14 @@ class X86VectorCpu(GenericVectorCpu):
     ) -> PsExpression:
         match expr:
             case PsUnOp() | PsBinOp():
-                func = _x86_op_intrin(self._vector_arch, expr, expr.get_dtype())
+                vtype: PsType
+                if isinstance(expr, PsVecHorizontal):
+                    # return type of expression itself is scalar, but input argument to intrinsic is a vector
+                    vtype = expr.vector_operand.get_dtype()
+                else:
+                    vtype = expr.get_dtype()
+
+                func = _x86_op_intrin(self._vector_arch, expr, vtype)
                 intrinsic = func(*operands)
                 intrinsic.dtype = func.return_type
                 return intrinsic
@@ -339,6 +349,7 @@ def _x86_op_intrin(
     prefix = varch.intrin_prefix(vtype)
     suffix = varch.intrin_suffix(vtype)
     rtype = atype = varch.intrin_type(vtype)
+    atypes: Tuple[PsType, ...] = ()
 
     match op:
         case PsVecBroadcast():
@@ -346,6 +357,16 @@ def _x86_op_intrin(
             if vtype.scalar_type == SInt(64) and vtype.vector_entries <= 4:
                 suffix += "x"
             atype = vtype.scalar_type
+        case PsVecHorizontal():
+            # horizontal add instead of sub avoids double inversion of sign
+            actual_op = (
+                ReductionOp.Add
+                if op.reduction_op == ReductionOp.Sub
+                else op.reduction_op
+            )
+            opstr = f"horizontal_{actual_op.name.lower()}"
+            rtype = vtype.scalar_type
+            atypes = (vtype.scalar_type, vtype)
         case PsAdd():
             opstr = "add"
         case PsSub():
@@ -392,7 +413,9 @@ def _x86_op_intrin(
                 case (SInt(64), Fp()) | (
                     Fp(),
                     SInt(64),
-                ) if varch < X86VectorArch.AVX512:
+                ) if (
+                    varch < X86VectorArch.AVX512
+                ):
                     panic()
                 # AVX512 only: cvtepiA_epiT if A > T
                 case (SInt(a), SInt(t)) if a > t and varch < X86VectorArch.AVX512:
@@ -408,4 +431,7 @@ def _x86_op_intrin(
             )
 
     num_args = 1 if isinstance(op, PsUnOp) else 2
-    return CFunction(f"{prefix}_{opstr}_{suffix}", (atype,) * num_args, rtype)
+    if not atypes:
+        atypes = (atype,) * num_args
+
+    return CFunction(f"{prefix}_{opstr}_{suffix}", atypes, rtype)
diff --git a/src/pystencils/backend/reduction_op_mapping.py b/src/pystencils/backend/reduction_op_mapping.py
new file mode 100644
index 0000000000000000000000000000000000000000..a97a496e04e78f4a76be12cae679027a7da56f86
--- /dev/null
+++ b/src/pystencils/backend/reduction_op_mapping.py
@@ -0,0 +1,24 @@
+from .ast.expressions import PsExpression, PsCall, PsAdd, PsSub, PsMul, PsDiv
+from .exceptions import PsInternalCompilerError
+from .functions import PsMathFunction, MathFunctions
+from ..sympyextensions.reduction import ReductionOp
+
+
+def reduction_op_to_expr(op: ReductionOp, op1, op2) -> PsExpression:
+    match op:
+        case ReductionOp.Add:
+            return PsAdd(op1, op2)
+        case ReductionOp.Sub:
+            return PsSub(op1, op2)
+        case ReductionOp.Mul:
+            return PsMul(op1, op2)
+        case ReductionOp.Div:
+            return PsDiv(op1, op2)
+        case ReductionOp.Min:
+            return PsCall(PsMathFunction(MathFunctions.Min), [op1, op2])
+        case ReductionOp.Max:
+            return PsCall(PsMathFunction(MathFunctions.Max), [op1, op2])
+        case _:
+            raise PsInternalCompilerError(
+                f"Found unsupported operation type for reduction assignments: {op}."
+            )
diff --git a/src/pystencils/backend/transformations/add_pragmas.py b/src/pystencils/backend/transformations/add_pragmas.py
index bd782422f1fa80b96ec7cf69473fda2b1f45c3d6..434903da013649c48bb293964f5ad308f2d7d2c7 100644
--- a/src/pystencils/backend/transformations/add_pragmas.py
+++ b/src/pystencils/backend/transformations/add_pragmas.py
@@ -8,7 +8,9 @@ from ..kernelcreation import KernelCreationContext
 from ..ast import PsAstNode
 from ..ast.structural import PsBlock, PsLoop, PsPragma, PsStructuralNode
 from ..ast.expressions import PsExpression
+from ..kernelcreation.context import ReductionInfo
 
+from ...types import PsScalarType
 
 __all__ = ["InsertPragmasAtLoops", "LoopPragma", "AddOpenMP"]
 
@@ -103,12 +105,16 @@ class AddOpenMP:
     def __init__(
         self,
         ctx: KernelCreationContext,
+        reductions: Sequence[ReductionInfo] = (),
         nesting_depth: int = 0,
         num_threads: int | None = None,
         schedule: str | None = None,
         collapse: int | None = None,
         omit_parallel: bool = False,
     ) -> None:
+
+        self._reductions = reductions
+
         pragma_text = "omp"
 
         if not omit_parallel:
@@ -122,6 +128,16 @@ class AddOpenMP:
         if num_threads is not None:
             pragma_text += f" num_threads({str(num_threads)})"
 
+        for reduction_info in self._reductions:
+            if isinstance(reduction_info.local_symbol.dtype, PsScalarType):
+                pragma_text += (
+                    f" reduction({reduction_info.op.value}: {reduction_info.local_symbol.name})"
+                )
+            else:
+                NotImplementedError(
+                    "OMP: Reductions for non-scalar data types are not supported yet."
+                )
+
         if collapse is not None:
             if collapse <= 0:
                 raise ValueError(
diff --git a/src/pystencils/backend/transformations/loop_vectorizer.py b/src/pystencils/backend/transformations/loop_vectorizer.py
index e1e4fea502c08de86e13de5e3c251f1b7a7d0ee6..b029b95a74bf014e47b5214435cb7ce82be1b623 100644
--- a/src/pystencils/backend/transformations/loop_vectorizer.py
+++ b/src/pystencils/backend/transformations/loop_vectorizer.py
@@ -1,15 +1,22 @@
 import numpy as np
 from enum import Enum, auto
-from typing import cast, Callable, overload
+from typing import cast, Callable, overload, Sequence
 
+from ..kernelcreation.context import ReductionInfo
 from ...types import PsVectorType, PsScalarType
 
 from ..kernelcreation import KernelCreationContext
 from ..constants import PsConstant
 from ..ast import PsAstNode
-from ..ast.structural import PsLoop, PsBlock, PsDeclaration
-from ..ast.expressions import PsExpression, PsTernary, PsGt
-from ..ast.vector import PsVecBroadcast
+from ..ast.structural import (
+    PsLoop,
+    PsBlock,
+    PsDeclaration,
+    PsAssignment,
+    PsStructuralNode,
+)
+from ..ast.expressions import PsExpression, PsTernary, PsGt, PsSymbolExpr
+from ..ast.vector import PsVecBroadcast, PsVecHorizontal
 from ..ast.analysis import collect_undefined_symbols
 
 from .ast_vectorizer import VectorizationAxis, VectorizationContext, AstVectorizer
@@ -48,10 +55,12 @@ class LoopVectorizer:
         self,
         ctx: KernelCreationContext,
         lanes: int,
+        reductions: Sequence[ReductionInfo] = (),
         trailing_iters: TrailingItersTreatment = TrailingItersTreatment.SCALAR_LOOP,
     ):
         self._ctx = ctx
         self._lanes = lanes
+        self._reductions = reductions
         self._trailing_iters = trailing_iters
 
         from ..kernelcreation import Typifier
@@ -134,6 +143,45 @@ class LoopVectorizer:
         #   Prepare vectorization context
         vc = VectorizationContext(self._ctx, self._lanes, axis)
 
+        #   Prepare reductions found in loop body
+        simd_init_local_reduction_vars: list[PsStructuralNode] = []
+        simd_writeback_local_reduction_vars: list[PsStructuralNode] = []
+        for stmt in loop.body.statements:
+            for reduction_info in self._reductions:
+                match stmt:
+                    case PsAssignment(lhs, _) if (
+                        isinstance(lhs, PsSymbolExpr)
+                        and lhs.symbol is reduction_info.local_symbol
+                    ):
+                        # Vectorize symbol for local copy
+                        local_symbol = lhs.symbol
+                        vector_symb = vc.vectorize_symbol(local_symbol)
+
+                        # Declare and init vector
+                        simd_init_local_reduction_vars += [
+                            self._type_fold(
+                                PsDeclaration(
+                                    PsSymbolExpr(vector_symb),
+                                    PsVecBroadcast(
+                                        self._lanes,
+                                        reduction_info.init_val.clone(),
+                                    ),
+                                )
+                            )
+                        ]
+
+                        # Write back vectorization result
+                        simd_writeback_local_reduction_vars += [
+                            PsAssignment(
+                                PsSymbolExpr(local_symbol),
+                                PsVecHorizontal(
+                                    PsSymbolExpr(local_symbol),
+                                    PsSymbolExpr(vector_symb),
+                                    reduction_info.op,
+                                ),
+                            )
+                        ]
+
         #   Generate vectorized loop body
         simd_body = self._vectorize_ast(loop.body, vc)
 
@@ -224,10 +272,10 @@ class LoopVectorizer:
                 )
 
                 return PsBlock(
-                    [
-                        simd_stop_decl,
-                        simd_step_decl,
-                        simd_loop,
+                    simd_init_local_reduction_vars
+                    + [simd_stop_decl, simd_step_decl, simd_loop]
+                    + simd_writeback_local_reduction_vars
+                    + [
                         trailing_start_decl,
                         trailing_loop,
                     ]
@@ -238,11 +286,13 @@ class LoopVectorizer:
 
             case LoopVectorizer.TrailingItersTreatment.NONE:
                 return PsBlock(
-                    [
+                    simd_init_local_reduction_vars
+                    + [
                         simd_stop_decl,
                         simd_step_decl,
                         simd_loop,
                     ]
+                    + simd_writeback_local_reduction_vars
                 )
 
     @overload
diff --git a/src/pystencils/backend/transformations/select_functions.py b/src/pystencils/backend/transformations/select_functions.py
index 1580f9d39eb8cbba0e7ace925a0631d34f5f1137..5953bd47db53e716fffef9b2cd1be1d2897fcf52 100644
--- a/src/pystencils/backend/transformations/select_functions.py
+++ b/src/pystencils/backend/transformations/select_functions.py
@@ -1,7 +1,9 @@
+from ..ast.structural import PsAssignment, PsBlock, PsStructuralNode
+from ..exceptions import MaterializationError
 from ..platforms import Platform
 from ..ast import PsAstNode
-from ..ast.expressions import PsCall
-from ..functions import PsMathFunction, PsConstantFunction
+from ..ast.expressions import PsCall, PsExpression
+from ..functions import PsMathFunction, PsConstantFunction, PsReductionWriteBack
 
 
 class SelectFunctions:
@@ -17,9 +19,41 @@ class SelectFunctions:
     def visit(self, node: PsAstNode) -> PsAstNode:
         node.children = [self.visit(c) for c in node.children]
 
-        if isinstance(node, PsCall) and isinstance(
-            node.function, (PsMathFunction | PsConstantFunction)
+        if isinstance(node, PsAssignment):
+            rhs = node.rhs
+            if isinstance(rhs, PsCall) and isinstance(
+                rhs.function, PsReductionWriteBack
+            ):
+                resolved_func = self._platform.select_function(rhs)
+
+                match resolved_func:
+                    case (prepend, new_rhs):
+                        assert isinstance(prepend, tuple)
+
+                        match new_rhs:
+                            case PsExpression():
+                                return PsBlock(
+                                    prepend + (PsAssignment(node.lhs, new_rhs),)
+                                )
+                            case PsStructuralNode():
+                                # special case: produces structural with atomic operation writing value back to ptr
+                                return PsBlock(prepend + (new_rhs,))
+                            case _:
+                                assert False, "Unexpected output from SelectFunctions."
+                    case PsExpression():
+                        return PsAssignment(node.lhs, resolved_func)
+                    case _:
+                        raise MaterializationError(
+                            f"Wrong return type for resolved function {rhs.function.name} in SelectFunctions."
+                        )
+            else:
+                return node
+        elif isinstance(node, PsCall) and isinstance(
+                node.function, (PsMathFunction | PsConstantFunction)
         ):
-            return self._platform.select_function(node)
+            resolved_func = self._platform.select_function(node)
+            assert isinstance(resolved_func, PsExpression)
+
+            return resolved_func
         else:
             return node
diff --git a/src/pystencils/backend/transformations/select_intrinsics.py b/src/pystencils/backend/transformations/select_intrinsics.py
index 060192810a7ccb9ab9ed13f64dd7948791078ea4..b20614393b439a8e5e70d682362dbec6769df9c2 100644
--- a/src/pystencils/backend/transformations/select_intrinsics.py
+++ b/src/pystencils/backend/transformations/select_intrinsics.py
@@ -7,7 +7,7 @@ from ..ast.structural import PsAstNode, PsDeclaration, PsAssignment, PsStatement
 from ..ast.expressions import PsExpression, PsCall, PsCast, PsLiteral
 from ...types import PsCustomType, PsVectorType, constify, deconstify
 from ..ast.expressions import PsSymbolExpr, PsConstantExpr, PsUnOp, PsBinOp
-from ..ast.vector import PsVecMemAcc
+from ..ast.vector import PsVecMemAcc, PsVecHorizontal
 from ..exceptions import MaterializationError
 from ..functions import CFunction, PsMathFunction
 
@@ -86,6 +86,10 @@ class SelectIntrinsics:
                 new_rhs = self.visit_expr(rhs, sc)
                 return PsStatement(self._platform.vector_store(lhs, new_rhs))
 
+            case PsAssignment(lhs, rhs) if isinstance(rhs, PsVecHorizontal):
+                new_rhs = self.visit_expr(rhs, sc)
+                return PsAssignment(lhs, new_rhs)
+
             case _:
                 node.children = [self.visit(c, sc) for c in node.children]
 
@@ -93,7 +97,15 @@ class SelectIntrinsics:
 
     def visit_expr(self, expr: PsExpression, sc: SelectionContext) -> PsExpression:
         if not isinstance(expr.dtype, PsVectorType):
-            return expr
+            # special case: result type of horizontal reduction is scalar
+            if isinstance(expr, PsVecHorizontal):
+                scalar_op = expr.scalar_operand
+                vector_op_to_scalar = self.visit_expr(expr.vector_operand, sc)
+                return self._platform.op_intrinsic(
+                    expr, [scalar_op, vector_op_to_scalar]
+                )
+            else:
+                return expr
 
         match expr:
             case PsSymbolExpr(symb):
diff --git a/src/pystencils/codegen/driver.py b/src/pystencils/codegen/driver.py
index e9fc69b76b3d88024f9cbde880617d6a3a3696ff..d9531b33c1fd6e09d27bf9ab4ba5493c5eb1b17a 100644
--- a/src/pystencils/codegen/driver.py
+++ b/src/pystencils/codegen/driver.py
@@ -19,14 +19,22 @@ from .properties import PsSymbolProperty, FieldBasePtr
 from .parameters import Parameter
 from .functions import Lambda
 from .gpu_indexing import GpuIndexing, GpuLaunchConfiguration
+from ..backend.kernelcreation.context import ReductionInfo
 
 from ..field import Field
 from ..types import PsIntegerType, PsScalarType
 
 from ..backend.memory import PsSymbol
 from ..backend.ast import PsAstNode
-from ..backend.ast.structural import PsBlock, PsLoop
-from ..backend.ast.expressions import PsExpression
+from ..backend.functions import PsReductionWriteBack
+from ..backend.ast.expressions import (
+    PsExpression,
+    PsSymbolExpr,
+    PsCall,
+    PsMemAcc,
+    PsConstantExpr,
+)
+from ..backend.ast.structural import PsBlock, PsLoop, PsDeclaration, PsAssignment, PsStructuralNode
 from ..backend.ast.analysis import collect_undefined_symbols, collect_required_headers
 from ..backend.kernelcreation import (
     KernelCreationContext,
@@ -183,6 +191,10 @@ class DefaultKernelCreationDriver:
         if self._intermediates is not None:
             self._intermediates.constants_eliminated = kernel_ast.clone()
 
+        #   Extensions for reductions
+        for _, reduction_info in self._ctx.reduction_data.items():
+            self._modify_kernel_ast_for_reductions(reduction_info, kernel_ast)
+
         #   Target-Specific optimizations
         if self._target.is_cpu():
             kernel_ast = self._transform_for_cpu(kernel_ast)
@@ -283,6 +295,31 @@ class DefaultKernelCreationDriver:
 
         return kernel_body
 
+    def _modify_kernel_ast_for_reductions(self,
+                                          reduction_info: ReductionInfo,
+                                          kernel_ast: PsBlock):
+        # typify local symbol and write-back pointer expressions and initial value
+        typify = Typifier(self._ctx)
+        symbol_expr = typify(PsSymbolExpr(reduction_info.local_symbol))
+        ptr_symbol_expr = typify(PsSymbolExpr(reduction_info.writeback_ptr_symbol))
+        init_val = typify(reduction_info.init_val)
+
+        ptr_access = PsMemAcc(
+            ptr_symbol_expr, PsConstantExpr(PsConstant(0, self._ctx.index_dtype))
+        )
+        write_back_ptr = PsCall(
+            PsReductionWriteBack(reduction_info.op),
+            [ptr_symbol_expr, symbol_expr],
+        )
+
+        # declare and init local copy with neutral element
+        prepend_ast: list[PsStructuralNode] = [PsDeclaration(symbol_expr, init_val)]
+        # write back result to reduction target variable
+        append_ast: list[PsStructuralNode] = [PsAssignment(ptr_access, write_back_ptr)]
+
+        # modify AST
+        kernel_ast.statements = prepend_ast + kernel_ast.statements + append_ast
+
     def _transform_for_cpu(self, kernel_ast: PsBlock) -> PsBlock:
         canonicalize = CanonicalizeSymbols(self._ctx, True)
         kernel_ast = cast(PsBlock, canonicalize(kernel_ast))
@@ -321,6 +358,7 @@ class DefaultKernelCreationDriver:
 
             add_omp = AddOpenMP(
                 self._ctx,
+                reductions=list(self._ctx.reduction_data.values()),
                 nesting_depth=omp_options.get_option("nesting_depth"),
                 num_threads=omp_options.get_option("num_threads"),
                 schedule=omp_options.get_option("schedule"),
@@ -381,7 +419,8 @@ class DefaultKernelCreationDriver:
                 self._target, cast(PsScalarType, self._ctx.default_dtype)
             )
 
-        vectorizer = LoopVectorizer(self._ctx, num_lanes)
+        vectorizer = LoopVectorizer(self._ctx, num_lanes,
+                                    list(self._ctx.reduction_data.values()))
 
         def loop_predicate(loop: PsLoop):
             return loop.counter.symbol == inner_loop_dim.counter
@@ -405,14 +444,18 @@ class DefaultKernelCreationDriver:
 
         idx_scheme: GpuIndexingScheme = self._cfg.gpu.get_option("indexing_scheme")
         manual_launch_grid: bool = self._cfg.gpu.get_option("manual_launch_grid")
-        assume_warp_aligned_block_size: bool = self._cfg.gpu.get_option("assume_warp_aligned_block_size")
+        assume_warp_aligned_block_size: bool = self._cfg.gpu.get_option(
+            "assume_warp_aligned_block_size"
+        )
         warp_size: int | None = self._cfg.gpu.get_option("warp_size")
 
         if warp_size is None:
             warp_size = GpuOptions.default_warp_size(self._target)
 
         if warp_size is None and assume_warp_aligned_block_size:
-            warn("GPU warp size is unknown - ignoring assumption `assume_warp_aligned_block_size`.")
+            warn(
+                "GPU warp size is unknown - ignoring assumption `assume_warp_aligned_block_size`."
+            )
 
         return GpuIndexing(
             self._ctx,
@@ -457,6 +500,11 @@ class DefaultKernelCreationDriver:
                 else None
             )
 
+            assume_warp_aligned_block_size: bool = self._cfg.gpu.get_option(
+                "assume_warp_aligned_block_size"
+            )
+            warp_size: int | None = self._cfg.gpu.get_option("warp_size")
+
             GpuPlatform: type
             match self._target:
                 case Target.CUDA:
@@ -468,6 +516,8 @@ class DefaultKernelCreationDriver:
 
             return GpuPlatform(
                 self._ctx,
+                assume_warp_aligned_block_size,
+                warp_size,
                 thread_mapping=thread_mapping,
             )
 
diff --git a/src/pystencils/codegen/gpu_indexing.py b/src/pystencils/codegen/gpu_indexing.py
index 43b612bd77f535ef45c09c38489784dd12a53ce0..f5606da0256039ba187d62583f6838e825002f63 100644
--- a/src/pystencils/codegen/gpu_indexing.py
+++ b/src/pystencils/codegen/gpu_indexing.py
@@ -260,6 +260,7 @@ class DynamicBlockSizeLaunchConfiguration(GpuLaunchConfiguration):
 
     def __init__(
         self,
+        rank: int,
         num_work_items: _Dim3Lambda,
         hw_props: HardwareProperties,
         assume_warp_aligned_block_size: bool,
@@ -270,7 +271,7 @@ class DynamicBlockSizeLaunchConfiguration(GpuLaunchConfiguration):
 
         self._assume_warp_aligned_block_size = assume_warp_aligned_block_size
 
-        default_bs = GpuLaunchConfiguration.get_default_block_size(len(num_work_items))
+        default_bs = GpuLaunchConfiguration.get_default_block_size(rank)
         self._default_block_size = default_bs
         self._init_block_size: dim3 = default_bs
         self._compute_block_size: (
@@ -598,6 +599,7 @@ class GpuIndexing:
 
         def factory():
             return DynamicBlockSizeLaunchConfiguration(
+                rank,
                 num_work_items,
                 self._hw_props,
                 self._assume_warp_aligned_block_size,
diff --git a/src/pystencils/include/pystencils_runtime/bits/gpu_atomics.h b/src/pystencils/include/pystencils_runtime/bits/gpu_atomics.h
new file mode 100644
index 0000000000000000000000000000000000000000..6de5c3321e1cff84e00cdf7b3c551638ecb2a99e
--- /dev/null
+++ b/src/pystencils/include/pystencils_runtime/bits/gpu_atomics.h
@@ -0,0 +1,90 @@
+#pragma once
+
+// No direct implementation for all atomic operations available
+// -> add support by custom implementations using a CAS mechanism
+
+#if defined(__CUDA_ARCH__) || defined(__HIPCC_RTC__)
+
+// - atomicMul (double/float)
+//   see https://stackoverflow.com/questions/43354798/atomic-multiplication-and-division
+__device__ double atomicMul(double* address, double val) {
+    unsigned long long int* address_as_ull = (unsigned long long int*)address;
+    unsigned long long int oldValue = *address_as_ull, assumed;
+    do {
+      assumed = oldValue;
+      oldValue = atomicCAS(address_as_ull, assumed, __double_as_longlong(val *
+                           __longlong_as_double(assumed)));
+    } while (assumed != oldValue);
+
+    return __longlong_as_double(oldValue);
+}
+
+__device__ float atomicMul(float* address, float val) {
+    int* address_as_int = (int*)address;
+    int old = *address_as_int;
+    int assumed;
+    do {
+        assumed = old;
+        old = atomicCAS(address_as_int, assumed, __float_as_int(val * __int_as_float(assumed)));
+    } while (assumed != old);
+
+    return __int_as_float(old);
+}
+
+#endif
+
+#ifdef __CUDA_ARCH__
+
+// - atomicMin (double/float)
+//   see https://stackoverflow.com/questions/17399119/how-do-i-use-atomicmax-on-floating-point-values-in-cuda
+__device__ __forceinline__ double atomicMin(double *address, double val)
+{
+    unsigned long long ret = __double_as_longlong(*address);
+    while(val < __longlong_as_double(ret))
+    {
+        unsigned long long old = ret;
+        if((ret = atomicCAS((unsigned long long *)address, old, __double_as_longlong(val))) == old)
+            break;
+    }
+    return __longlong_as_double(ret);
+}
+
+__device__ __forceinline__ float atomicMin(float *address, float val)
+{
+    int ret = __float_as_int(*address);
+    while(val < __int_as_float(ret))
+    {
+        int old = ret;
+        if((ret = atomicCAS((int *)address, old, __float_as_int(val))) == old)
+            break;
+    }
+    return __int_as_float(ret);
+}
+
+// - atomicMax (double/float)
+//   see https://stackoverflow.com/questions/17399119/how-do-i-use-atomicmax-on-floating-point-values-in-cuda
+__device__ __forceinline__ double atomicMax(double *address, double val)
+{
+    unsigned long long ret = __double_as_longlong(*address);
+    while(val > __longlong_as_double(ret))
+    {
+        unsigned long long old = ret;
+        if((ret = atomicCAS((unsigned long long *)address, old, __double_as_longlong(val))) == old)
+            break;
+    }
+    return __longlong_as_double(ret);
+}
+
+__device__ __forceinline__ float atomicMax(float *address, float val)
+{
+    int ret = __float_as_int(*address);
+    while(val > __int_as_float(ret))
+    {
+        int old = ret;
+        if((ret = atomicCAS((int *)address, old, __float_as_int(val))) == old)
+            break;
+    }
+    return __int_as_float(ret);
+}
+
+#endif
\ No newline at end of file
diff --git a/src/pystencils/include/pystencils_runtime/cuda.cuh b/src/pystencils/include/pystencils_runtime/cuda.cuh
index 6a22e0b9034d224a4fda52233faab73cabd8a01d..691519a0144ac1a50094fac73d165d8dc31adba3 100644
--- a/src/pystencils/include/pystencils_runtime/cuda.cuh
+++ b/src/pystencils/include/pystencils_runtime/cuda.cuh
@@ -3,3 +3,4 @@
 #include <cuda_fp16.h>
 
 #include "./bits/gpu_infinities.h"
+#include "./bits/gpu_atomics.h"
diff --git a/src/pystencils/include/pystencils_runtime/hip.h b/src/pystencils/include/pystencils_runtime/hip.h
index 10084103af058d80aa68cc8e2bc58e7c31246181..32d5e84c67f4014e88a873123a4dfd90b7bcfee3 100644
--- a/src/pystencils/include/pystencils_runtime/hip.h
+++ b/src/pystencils/include/pystencils_runtime/hip.h
@@ -3,6 +3,7 @@
 #include <hip/hip_fp16.h>
 
 #include "./bits/gpu_infinities.h"
+#include "./bits/gpu_atomics.h"
 
 #ifdef __HIPCC_RTC__
 typedef __hip_uint8_t uint8_t;
diff --git a/src/pystencils/include/pystencils_runtime/simd_horizontal_helpers.h b/src/pystencils/include/pystencils_runtime/simd_horizontal_helpers.h
new file mode 100644
index 0000000000000000000000000000000000000000..bd1889153a56876032781b77dcb66a413b3fe07d
--- /dev/null
+++ b/src/pystencils/include/pystencils_runtime/simd_horizontal_helpers.h
@@ -0,0 +1,231 @@
+#pragma once
+
+#include <cmath>
+
+#if defined(__SSE3__)
+#include <immintrin.h>
+
+inline double _mm_horizontal_add_pd(double dst, __m128d src) { 
+	__m128d _v = src;
+	return dst + _mm_cvtsd_f64(_mm_hadd_pd(_v, _v));
+}
+
+inline float _mm_horizontal_add_ps(float dst, __m128 src) { 
+	__m128 _v = src;
+	__m128 _h = _mm_hadd_ps(_v, _v);
+	return dst + _mm_cvtss_f32(_mm_add_ps(_h, _mm_movehdup_ps(_h)));
+}
+
+inline double _mm_horizontal_mul_pd(double dst, __m128d src) { 
+	__m128d _v = src;
+	double _r = _mm_cvtsd_f64(_mm_mul_pd(_v, _mm_shuffle_pd(_v, _v, 1)));
+	return dst * _r;
+}
+
+inline float _mm_horizontal_mul_ps(float dst, __m128 src) { 
+	__m128 _v = src;
+	__m128 _h = _mm_mul_ps(_v, _mm_shuffle_ps(_v, _v, 177));
+	float _r = _mm_cvtss_f32(_mm_mul_ps(_h, _mm_shuffle_ps(_h, _h, 10)));
+	return dst * _r;
+}
+
+inline double _mm_horizontal_min_pd(double dst, __m128d src) { 
+	__m128d _v = src;
+	double _r = _mm_cvtsd_f64(_mm_min_pd(_v, _mm_shuffle_pd(_v, _v, 1)));
+	return fmin(_r, dst);
+}
+
+inline float _mm_horizontal_min_ps(float dst, __m128 src) { 
+	__m128 _v = src;
+	__m128 _h = _mm_min_ps(_v, _mm_shuffle_ps(_v, _v, 177));
+	float _r = _mm_cvtss_f32(_mm_min_ps(_h, _mm_shuffle_ps(_h, _h, 10)));
+	return fmin(_r, dst);
+}
+
+inline double _mm_horizontal_max_pd(double dst, __m128d src) { 
+	__m128d _v = src;
+	double _r = _mm_cvtsd_f64(_mm_max_pd(_v, _mm_shuffle_pd(_v, _v, 1)));
+	return fmax(_r, dst);
+}
+
+inline float _mm_horizontal_max_ps(float dst, __m128 src) { 
+	__m128 _v = src;
+	__m128 _h = _mm_max_ps(_v, _mm_shuffle_ps(_v, _v, 177));
+	float _r = _mm_cvtss_f32(_mm_max_ps(_h, _mm_shuffle_ps(_h, _h, 10)));
+	return fmax(_r, dst);
+}
+
+#endif
+
+#if defined(__AVX__)
+#include <immintrin.h>
+
+inline double _mm256_horizontal_add_pd(double dst, __m256d src) { 
+	__m256d _v = src;
+	__m256d _h = _mm256_hadd_pd(_v, _v);
+	return dst + _mm_cvtsd_f64(_mm_add_pd(_mm256_extractf128_pd(_h,1), _mm256_castpd256_pd128(_h)));
+}
+
+inline float _mm256_horizontal_add_ps(float dst, __m256 src) { 
+	__m256 _v = src;
+	__m256 _h = _mm256_hadd_ps(_v, _v);
+	__m128  _i = _mm_add_ps(_mm256_extractf128_ps(_h,1), _mm256_castps256_ps128(_h));
+	return dst + _mm_cvtss_f32(_mm_hadd_ps(_i,_i));
+}
+
+inline double _mm256_horizontal_mul_pd(double dst, __m256d src) { 
+	__m256d _v = src;
+	__m128d _w = _mm_mul_pd(_mm256_extractf128_pd(_v,1), _mm256_castpd256_pd128(_v));
+	double _r = _mm_cvtsd_f64(_mm_mul_pd(_w, _mm_permute_pd(_w,1))); 
+	return dst * _r;
+}
+
+inline float _mm256_horizontal_mul_ps(float dst, __m256 src) { 
+	__m256 _v = src;
+	__m128 _w = _mm_mul_ps(_mm256_extractf128_ps(_v,1), _mm256_castps256_ps128(_v));
+	__m128 _h = _mm_mul_ps(_w, _mm_shuffle_ps(_w, _w, 177));
+	float _r = _mm_cvtss_f32(_mm_mul_ps(_h, _mm_shuffle_ps(_h, _h, 10)));
+	return dst * _r;
+}
+
+inline double _mm256_horizontal_min_pd(double dst, __m256d src) { 
+	__m256d _v = src;
+	__m128d _w = _mm_min_pd(_mm256_extractf128_pd(_v,1), _mm256_castpd256_pd128(_v));
+	double _r = _mm_cvtsd_f64(_mm_min_pd(_w, _mm_permute_pd(_w,1))); 
+	return fmin(_r, dst);
+}
+
+inline float _mm256_horizontal_min_ps(float dst, __m256 src) { 
+	__m256 _v = src;
+	__m128 _w = _mm_min_ps(_mm256_extractf128_ps(_v,1), _mm256_castps256_ps128(_v));
+	__m128 _h = _mm_min_ps(_w, _mm_shuffle_ps(_w, _w, 177));
+	float _r = _mm_cvtss_f32(_mm_min_ps(_h, _mm_shuffle_ps(_h, _h, 10)));
+	return fmin(_r, dst);
+}
+
+inline double _mm256_horizontal_max_pd(double dst, __m256d src) { 
+	__m256d _v = src;
+	__m128d _w = _mm_max_pd(_mm256_extractf128_pd(_v,1), _mm256_castpd256_pd128(_v));
+	double _r = _mm_cvtsd_f64(_mm_max_pd(_w, _mm_permute_pd(_w,1))); 
+	return fmax(_r, dst);
+}
+
+inline float _mm256_horizontal_max_ps(float dst, __m256 src) { 
+	__m256 _v = src;
+	__m128 _w = _mm_max_ps(_mm256_extractf128_ps(_v,1), _mm256_castps256_ps128(_v));
+	__m128 _h = _mm_max_ps(_w, _mm_shuffle_ps(_w, _w, 177));
+	float _r = _mm_cvtss_f32(_mm_max_ps(_h, _mm_shuffle_ps(_h, _h, 10)));
+	return fmax(_r, dst);
+}
+
+#endif
+
+#if defined(__AVX512F__)
+#include <immintrin.h>
+
+inline double _mm512_horizontal_add_pd(double dst, __m512d src) { 
+	double _r = _mm512_reduce_add_pd(src);
+	return dst + _r;
+}
+
+inline float _mm512_horizontal_add_ps(float dst, __m512 src) { 
+	float _r = _mm512_reduce_add_ps(src);
+	return dst + _r;
+}
+
+inline double _mm512_horizontal_mul_pd(double dst, __m512d src) { 
+	double _r = _mm512_reduce_mul_pd(src);
+	return dst * _r;
+}
+
+inline float _mm512_horizontal_mul_ps(float dst, __m512 src) { 
+	float _r = _mm512_reduce_mul_ps(src);
+	return dst * _r;
+}
+
+inline double _mm512_horizontal_min_pd(double dst, __m512d src) { 
+	double _r = _mm512_reduce_min_pd(src);
+	return fmin(_r, dst);
+}
+
+inline float _mm512_horizontal_min_ps(float dst, __m512 src) { 
+	float _r = _mm512_reduce_min_ps(src);
+	return fmin(_r, dst);
+}
+
+inline double _mm512_horizontal_max_pd(double dst, __m512d src) { 
+	double _r = _mm512_reduce_max_pd(src);
+	return fmax(_r, dst);
+}
+
+inline float _mm512_horizontal_max_ps(float dst, __m512 src) { 
+	float _r = _mm512_reduce_max_ps(src);
+	return fmax(_r, dst);
+}
+
+#endif
+
+#if defined(_M_ARM64)
+#include <arm_neon.h>
+
+inline double vgetq_horizontal_add_f64(double dst, float64x2_t src) { 
+	float64x2_t _v = src;
+	double _r = vgetq_lane_f64(_v,0);
+	_r += vgetq_lane_f64(_v,1);
+	return dst + _r;
+}
+
+inline float vget_horizontal_add_f32(float dst, float32x4_t src) { 
+	float32x4_t _v = src;
+	float32x2_t _w = vadd_f32(vget_high_f32(_v), vget_low_f32(_v));
+	float _r = vgetq_lane_f32(_w,0);
+	_r += vget_lane_f32(_w,1);
+	return dst + _r;
+}
+
+inline double vgetq_horizontal_mul_f64(double dst, float64x2_t src) { 
+	float64x2_t _v = src;
+	double _r = vgetq_lane_f64(_v,0);
+	_r *= vgetq_lane_f64(_v,1);
+	return dst * _r;
+}
+
+inline float vget_horizontal_mul_f32(float dst, float32x4_t src) { 
+	float32x4_t _v = src;
+	float32x2_t _w = vmul_f32(vget_high_f32(_v), vget_low_f32(_v));
+	float _r = vgetq_lane_f32(_w,0);
+	_r *= vget_lane_f32(_w,1);
+	return dst * _r;
+}
+
+inline double vgetq_horizontal_min_f64(double dst, float64x2_t src) { 
+	float64x2_t _v = src;
+	double _r = vgetq_lane_f64(_v,0);
+	_r = fmin(_r, vgetq_lane_f64(_v,1));
+	return fmin(_r, dst);
+}
+
+inline float vget_horizontal_min_f32(float dst, float32x4_t src) { 
+	float32x4_t _v = src;
+	float32x2_t _w = vmin_f32(vget_high_f32(_v), vget_low_f32(_v));
+	float _r = vgetq_lane_f32(_w,0);
+	_r = fmin(_r, vget_lane_f32(_w,1));
+	return fmin(_r, dst);
+}
+
+inline double vgetq_horizontal_max_f64(double dst, float64x2_t src) { 
+	float64x2_t _v = src;
+	double _r = vgetq_lane_f64(_v,0);
+	_r = fmax(_r, vgetq_lane_f64(_v,1));
+	return fmax(_r, dst);
+}
+
+inline float vget_horizontal_max_f32(float dst, float32x4_t src) { 
+	float32x4_t _v = src;
+	float32x2_t _w = vmax_f32(vget_high_f32(_v), vget_low_f32(_v));
+	float _r = vgetq_lane_f32(_w,0);
+	_r = fmax(_r, vget_lane_f32(_w,1));
+	return fmax(_r, dst);
+}
+
+#endif
\ No newline at end of file
diff --git a/src/pystencils/jit/cpu_extension_module.py b/src/pystencils/jit/cpu_extension_module.py
index fca043db90725a441cb8b0ed9b99765c53eecb8d..4d76ea9ca1d75bc2b2ca2d77976b42c5f16b240c 100644
--- a/src/pystencils/jit/cpu_extension_module.py
+++ b/src/pystencils/jit/cpu_extension_module.py
@@ -92,6 +92,7 @@ class PsKernelExtensioNModule:
 
         #   Kernels and call wrappers
         from ..backend.emission import CAstPrinter
+
         printer = CAstPrinter(func_prefix="FUNC_PREFIX")
 
         for name, kernel in self._kernels.items():
@@ -200,13 +201,15 @@ if( !kwargs || !PyDict_Check(kwargs) ) {{
 """
 
     def __init__(self) -> None:
-        self._buffer_types: dict[Field, PsType] = dict()
-        self._array_extractions: dict[Field, str] = dict()
-        self._array_frees: dict[Field, str] = dict()
+        self._buffer_types: dict[Any, PsType] = dict()
+        self._array_extractions: dict[Any, str] = dict()
+        self._array_frees: dict[Any, str] = dict()
 
         self._array_assoc_var_extractions: dict[Parameter, str] = dict()
         self._scalar_extractions: dict[Parameter, str] = dict()
 
+        self._pointer_extractions: dict[Parameter, str] = dict()
+
         self._constraint_checks: list[str] = []
 
         self._call: str | None = None
@@ -232,38 +235,42 @@ if( !kwargs || !PyDict_Check(kwargs) ) {{
         else:
             return None
 
+    def get_buffer(self, buffer_name: str) -> str:
+        """Get the Python buffer object for a given buffer name."""
+        return f"buffer_{buffer_name}"
+
     def get_field_buffer(self, field: Field) -> str:
         """Get the Python buffer object for the given field."""
-        return f"buffer_{field.name}"
+        return self.get_buffer(field.name)
 
-    def extract_field(self, field: Field) -> None:
-        """Add the necessary code to extract the NumPy array for a given field"""
-        if field not in self._array_extractions:
-            extraction_code = self.TMPL_EXTRACT_ARRAY.format(name=field.name)
-            actual_dtype = self._buffer_types[field]
+    def extract_buffer(self, buffer: Any, name: str) -> None:
+        """Add the necessary code to extract the NumPy array for a given buffer"""
+        if buffer not in self._array_extractions:
+            extraction_code = self.TMPL_EXTRACT_ARRAY.format(name=name)
+            actual_dtype = self._buffer_types[buffer]
 
             #   Check array type
             type_char = self._type_char(actual_dtype)
             if type_char is not None:
-                dtype_cond = f"buffer_{field.name}.format[0] == '{type_char}'"
+                dtype_cond = f"buffer_{name}.format[0] == '{type_char}'"
                 extraction_code += self.TMPL_CHECK_ARRAY_TYPE.format(
                     cond=dtype_cond,
                     what="data type",
-                    name=field.name,
+                    name=name,
                     expected=str(actual_dtype),
                 )
 
             #   Check item size
             itemsize = actual_dtype.itemsize
-            item_size_cond = f"buffer_{field.name}.itemsize == {itemsize}"
+            item_size_cond = f"buffer_{name}.itemsize == {itemsize}"
             extraction_code += self.TMPL_CHECK_ARRAY_TYPE.format(
-                cond=item_size_cond, what="itemsize", name=field.name, expected=itemsize
+                cond=item_size_cond, what="itemsize", name=name, expected=itemsize
             )
 
-            self._array_extractions[field] = extraction_code
+            self._array_extractions[buffer] = extraction_code
 
-            release_code = f"PyBuffer_Release(&buffer_{field.name});"
-            self._array_frees[field] = release_code
+            release_code = f"PyBuffer_Release(&buffer_{name});"
+            self._array_frees[buffer] = release_code
 
     def extract_scalar(self, param: Parameter) -> str:
         if param not in self._scalar_extractions:
@@ -277,6 +284,26 @@ if( !kwargs || !PyDict_Check(kwargs) ) {{
 
         return param.name
 
+    def extract_ptr(self, param: Parameter) -> str:
+        if param not in self._pointer_extractions:
+            ptr = param.symbol
+            ptr_dtype = ptr.dtype
+
+            assert isinstance(ptr_dtype, PsPointerType)
+
+            self._buffer_types[ptr] = ptr_dtype.base_type
+            self.extract_buffer(ptr, param.name)
+            buffer = self.get_buffer(param.name)
+            code = (
+                f"{param.dtype.c_string()} {param.name} = ({param.dtype}) {buffer}.buf;"
+            )
+
+            assert code is not None
+
+            self._array_assoc_var_extractions[param] = code
+
+        return param.name
+
     def extract_array_assoc_var(self, param: Parameter) -> str:
         if param not in self._array_assoc_var_extractions:
             field = param.fields[0]
@@ -321,11 +348,13 @@ if( !kwargs || !PyDict_Check(kwargs) ) {{
                     actual_field_type = field.dtype
 
                 self._buffer_types[prop.field] = actual_field_type
-                self.extract_field(prop.field)
+                self.extract_buffer(prop.field, field.name)
 
         for param in params:
             if param.is_field_parameter:
                 self.extract_array_assoc_var(param)
+            elif isinstance(param.dtype, PsPointerType):
+                self.extract_ptr(param)
             else:
                 self.extract_scalar(param)
 
diff --git a/src/pystencils/jit/gpu_cupy.py b/src/pystencils/jit/gpu_cupy.py
index 7d2fe09915caad80545f6466c5c739a518b39419..893589a011e30fb184ee3e845c32ba2aea68eeb9 100644
--- a/src/pystencils/jit/gpu_cupy.py
+++ b/src/pystencils/jit/gpu_cupy.py
@@ -11,7 +11,6 @@ except ImportError:
 from ..codegen import Target
 from ..field import FieldType
 
-from ..types import PsType
 from .jit import JitBase, JitError, KernelWrapper
 from ..codegen import (
     Kernel,
@@ -20,7 +19,7 @@ from ..codegen import (
 )
 from ..codegen.gpu_indexing import GpuLaunchConfiguration
 from ..codegen.properties import FieldShape, FieldStride, FieldBasePtr
-from ..types import PsStructType, PsPointerType
+from ..types import PsType, PsStructType, PsPointerType
 
 from ..include import get_pystencils_include_path
 
@@ -186,10 +185,13 @@ class CupyKernelWrapper(KernelWrapper):
                                 arr.strides[coord] // arr.dtype.itemsize,
                             )
                             break
+            elif isinstance(kparam.dtype, PsPointerType):
+                val = kwargs[kparam.name]
+                kernel_args.append(val)
             else:
                 #   scalar parameter
-                val: Any = kwargs[param.name]
-                adder(param, val)
+                val = kwargs[kparam.name]
+                adder(kparam, val)
 
         #   Process Arguments
 
diff --git a/src/pystencils/simp/assignment_collection.py b/src/pystencils/simp/assignment_collection.py
index f1ba8715431d96fb2a09a01e45872def421fe94f..03b4edccfdd402d3013e0e82b924f5e5e87458c7 100644
--- a/src/pystencils/simp/assignment_collection.py
+++ b/src/pystencils/simp/assignment_collection.py
@@ -1,5 +1,6 @@
 import itertools
 from copy import copy
+
 from typing import Any, Dict, Iterable, Iterator, List, Optional, Sequence, Set, Union
 
 import sympy as sp
diff --git a/src/pystencils/sympyextensions/__init__.py b/src/pystencils/sympyextensions/__init__.py
index 3933fd9b3f9af5835d56be3c7f21096d540fde63..8c7d10a806a06a42bed5f6510304c517da6fae28 100644
--- a/src/pystencils/sympyextensions/__init__.py
+++ b/src/pystencils/sympyextensions/__init__.py
@@ -1,6 +1,7 @@
 from .astnodes import ConditionalFieldAccess
 from .typed_sympy import TypedSymbol, CastFunc, tcast, DynamicType
 from .pointers import mem_acc
+from .reduction import reduction_assignment, ReductionOp
 from .bit_masks import bit_conditional
 
 from .math import (
@@ -28,12 +29,14 @@ from .math import (
     count_operations_in_ast,
     common_denominator,
     get_symmetric_part,
-    SymbolCreator
+    SymbolCreator,
 )
 
 
 __all__ = [
     "ConditionalFieldAccess",
+    "reduction_assignment",
+    "ReductionOp",
     "TypedSymbol",
     "CastFunc",
     "tcast",
diff --git a/src/pystencils/sympyextensions/reduction.py b/src/pystencils/sympyextensions/reduction.py
new file mode 100644
index 0000000000000000000000000000000000000000..a1a9a026c4abc04f61358c18c3b7fb061a2a8d6e
--- /dev/null
+++ b/src/pystencils/sympyextensions/reduction.py
@@ -0,0 +1,82 @@
+from enum import Enum
+
+from sympy.codegen.ast import AssignmentBase
+
+from . import TypedSymbol
+
+
+class ReductionOp(Enum):
+    Add = "+"
+    Sub = "-"
+    Mul = "*"
+    Div = "/"
+    Min = "min"
+    Max = "max"
+
+
+class ReductionAssignment(AssignmentBase):
+    """
+    Base class for reduced assignments.
+
+    Attributes:
+    ===========
+
+    reduction_op : ReductionOp
+       Enum for binary operation being applied in the assignment, such as "Add" for "+", "Sub" for "-", etc.
+    """
+
+    _reduction_op = None  # type: ReductionOp
+
+    @property
+    def reduction_op(self):
+        return self._reduction_op
+
+    @reduction_op.setter
+    def reduction_op(self, op):
+        self._reduction_op = op
+
+    @classmethod
+    def _check_args(cls, lhs, rhs):
+        super()._check_args(lhs, rhs)
+
+        if not isinstance(lhs, TypedSymbol):
+            raise TypeError(f"lhs of needs to be a TypedSymbol. Got {type(lhs)} instead.")
+
+
+class AddReductionAssignment(ReductionAssignment):
+    reduction_op = ReductionOp.Add
+
+
+class SubReductionAssignment(ReductionAssignment):
+    reduction_op = ReductionOp.Sub
+
+
+class MulReductionAssignment(ReductionAssignment):
+    reduction_op = ReductionOp.Mul
+
+
+class MinReductionAssignment(ReductionAssignment):
+    reduction_op = ReductionOp.Min
+
+
+class MaxReductionAssignment(ReductionAssignment):
+    reduction_op = ReductionOp.Max
+
+
+# Mapping from ReductionOp enum to ReductionAssigment classes
+_reduction_assignment_classes = {
+    cls.reduction_op: cls
+    for cls in [
+        AddReductionAssignment,
+        SubReductionAssignment,
+        MulReductionAssignment,
+        MinReductionAssignment,
+        MaxReductionAssignment,
+    ]
+}
+
+
+def reduction_assignment(lhs, op: ReductionOp, rhs):
+    if op not in _reduction_assignment_classes:
+        raise ValueError("Unrecognized operator %s" % op)
+    return _reduction_assignment_classes[op](lhs, rhs)
diff --git a/tests/frontend/test_sympyextensions.py b/tests/frontend/test_sympyextensions.py
index ad5d2513b4400db938b8372a83ef43cc9339b35d..152527441e38e1b0e30acccfa47263487620afbb 100644
--- a/tests/frontend/test_sympyextensions.py
+++ b/tests/frontend/test_sympyextensions.py
@@ -3,7 +3,9 @@ import numpy as np
 import sympy as sp
 import pystencils
 
-from pystencils import Assignment
+import pytest
+
+from pystencils import Assignment, TypedSymbol
 from pystencils.sympyextensions import replace_second_order_products
 from pystencils.sympyextensions import remove_higher_order_terms
 from pystencils.sympyextensions import complete_the_squares_in_exp
@@ -27,6 +29,16 @@ from pystencils.sympyextensions.integer_functions import (
     div_ceil,
 )
 
+from pystencils.sympyextensions.reduction import (
+    ReductionOp,
+    AddReductionAssignment,
+    SubReductionAssignment,
+    MulReductionAssignment,
+    MinReductionAssignment,
+    MaxReductionAssignment,
+    reduction_assignment,
+)
+
 
 def test_utility():
     a = [1, 2]
@@ -199,6 +211,30 @@ def test_count_operations():
     assert ops["muls"] == 99
 
 
+@pytest.mark.parametrize("reduction_assignment_for_op", [
+    (ReductionOp.Add, AddReductionAssignment),
+    (ReductionOp.Sub, SubReductionAssignment),
+    (ReductionOp.Mul, MulReductionAssignment),
+    (ReductionOp.Min, MinReductionAssignment),
+    (ReductionOp.Max, MaxReductionAssignment),
+])
+def test_reduction_assignments(
+        reduction_assignment_for_op
+):
+    reduction_op, reduction_assignment_type = reduction_assignment_for_op
+
+    w = TypedSymbol("w", "float64")
+    v = sympy.symbols("v")
+
+    assignment = reduction_assignment(w, reduction_op, 0)
+
+    assert isinstance(assignment, reduction_assignment_type)
+
+    # invalid assignment since v is not a typed symbol
+    with pytest.raises(TypeError):
+        _ = reduction_assignment(v, reduction_op, 0)
+
+
 def test_common_denominator():
     x = sympy.symbols("x")
     expr = sympy.Rational(1, 2) + x * sympy.Rational(2, 3)
diff --git a/tests/kernelcreation/test_reduction.py b/tests/kernelcreation/test_reduction.py
new file mode 100644
index 0000000000000000000000000000000000000000..1fb8efc818797d37717277c0a9a74e3734e3a556
--- /dev/null
+++ b/tests/kernelcreation/test_reduction.py
@@ -0,0 +1,93 @@
+import pytest
+import numpy as np
+
+import pystencils as ps
+from pystencils.sympyextensions import ReductionOp, reduction_assignment
+
+INIT_W = 5
+INIT_ARR = 2
+SIZE = 15
+SOLUTION = {
+    ReductionOp.Add: INIT_W + INIT_ARR * SIZE,
+    ReductionOp.Sub: INIT_W - INIT_ARR * SIZE,
+    ReductionOp.Mul: INIT_W * INIT_ARR**SIZE,
+    ReductionOp.Min: min(INIT_W, INIT_ARR),
+    ReductionOp.Max: max(INIT_W, INIT_ARR),
+}
+
+
+# get AST for kernel with reduction assignment
+def get_reduction_assign_ast(dtype, op, config):
+    x = ps.fields(f"x: {dtype}[1d]")
+    w = ps.TypedSymbol("w", dtype)
+
+    red_assign = reduction_assignment(w, op, x.center())
+
+    return ps.create_kernel([red_assign], config, default_dtype=dtype)
+
+
+@pytest.mark.parametrize("instruction_set", ["sse", "avx"])
+@pytest.mark.parametrize("dtype", ["float64", "float32"])
+@pytest.mark.parametrize("op", [ReductionOp.Add, ReductionOp.Sub, ReductionOp.Mul, ReductionOp.Min, ReductionOp.Max])
+def test_reduction_cpu(instruction_set, dtype, op):
+    vectorize_info = {
+        "instruction_set": instruction_set,
+        "assume_inner_stride_one": True,
+    }
+
+    config = ps.CreateKernelConfig(
+        target=ps.Target.CPU, cpu_openmp=True, cpu_vectorize_info=vectorize_info
+    )
+
+    ast_reduction = get_reduction_assign_ast(dtype, op, config)
+    ps.show_code(ast_reduction)
+    kernel_reduction = ast_reduction.compile()
+
+    array = np.full((SIZE,), INIT_ARR, dtype=dtype)
+    reduction_array = np.full((1,), INIT_W, dtype=dtype)
+
+    kernel_reduction(x=array, w=reduction_array)
+    assert np.allclose(reduction_array, SOLUTION[op])
+
+
+@pytest.mark.parametrize("dtype", ["float64", "float32"])
+@pytest.mark.parametrize("op", [ReductionOp.Add, ReductionOp.Sub, ReductionOp.Mul, ReductionOp.Min, ReductionOp.Max])
+@pytest.mark.parametrize("assume_warp_aligned_block_size", [True, False])
+@pytest.mark.parametrize("use_block_fitting", [True, False])
+def test_reduction_gpu(
+        dtype: str,
+        op: str,
+        assume_warp_aligned_block_size: bool,
+        use_block_fitting: bool,
+):
+    try:
+        import cupy as cp
+        from cupy_backends.cuda.api.runtime import CUDARuntimeError
+
+        device_count = range(cp.cuda.runtime.getDeviceCount())
+        print(f"Found {device_count} GPUs")
+    except ImportError:
+        pytest.skip(reason="CuPy is not available", allow_module_level=True)
+    except CUDARuntimeError:
+        pytest.skip(
+            reason="No CUDA capable device is detected", allow_module_level=True
+        )
+
+    cfg = ps.CreateKernelConfig(target=ps.Target.GPU)
+    cfg.gpu.assume_warp_aligned_block_size = assume_warp_aligned_block_size
+
+    ast_reduction = get_reduction_assign_ast(dtype, op, cfg)
+    ps.show_code(ast_reduction)
+    kernel_reduction = ast_reduction.compile()
+
+    if use_block_fitting:
+        kernel_reduction.launch_config.fit_block_size((32, 1, 1))
+
+    array = np.full((SIZE,), INIT_ARR, dtype=dtype)
+    reduction_array = np.full((1,), INIT_W, dtype=dtype)
+
+    array_gpu = cp.asarray(array)
+    reduction_array_gpu = cp.asarray(reduction_array)
+
+    kernel_reduction(x=array_gpu, w=reduction_array_gpu)
+    assert np.allclose(reduction_array_gpu.get(), SOLUTION[op])
diff --git a/tests/nbackend/kernelcreation/test_analysis.py b/tests/nbackend/kernelcreation/test_analysis.py
new file mode 100644
index 0000000000000000000000000000000000000000..d68c0a5f3c88fbc53c157ea8051403c9bb3c56b3
--- /dev/null
+++ b/tests/nbackend/kernelcreation/test_analysis.py
@@ -0,0 +1,38 @@
+import pytest
+
+from pystencils import fields, TypedSymbol, AddReductionAssignment, Assignment, KernelConstraintsError
+from pystencils.backend.kernelcreation import KernelCreationContext, KernelAnalysis
+from pystencils.sympyextensions import mem_acc
+from pystencils.types.quick import Ptr, Fp
+
+
+def test_invalid_reduction_symbol_reassign():
+    dtype = Fp(64)
+    ctx = KernelCreationContext(default_dtype=dtype)
+    analysis = KernelAnalysis(ctx)
+
+    x = fields(f"x: [1d]")
+    w = TypedSymbol("w", dtype)
+
+    # illegal reassign to already locally defined symbol (here: reduction symbol)
+    with pytest.raises(KernelConstraintsError):
+        analysis([
+            AddReductionAssignment(w, 3 * x.center()),
+            Assignment(w, 0)
+        ])
+
+def test_invalid_reduction_symbol_reference():
+    dtype = Fp(64)
+    ctx = KernelCreationContext(default_dtype=dtype)
+    analysis = KernelAnalysis(ctx)
+
+    x = fields(f"x: [1d]")
+    v = TypedSymbol("v", dtype)
+    w = TypedSymbol("w", dtype)
+
+    # do not allow reduction symbol to be referenced on rhs of other assignments
+    with pytest.raises(KernelConstraintsError):
+        analysis([
+            AddReductionAssignment(w, 3 * x.center()),
+            Assignment(v, w)
+        ])
\ No newline at end of file
diff --git a/tests/nbackend/kernelcreation/test_freeze.py b/tests/nbackend/kernelcreation/test_freeze.py
index 5384c90f90650bf6f19f2891a3ddec6e5ed34188..679a7cebcd74e794946e78efecd2e0bdedb97862 100644
--- a/tests/nbackend/kernelcreation/test_freeze.py
+++ b/tests/nbackend/kernelcreation/test_freeze.py
@@ -8,6 +8,7 @@ from pystencils import (
     create_numeric_type,
     TypedSymbol,
     DynamicType,
+    KernelConstraintsError,
 )
 from pystencils.sympyextensions import tcast, bit_conditional
 from pystencils.sympyextensions.pointers import mem_acc
@@ -44,6 +45,7 @@ from pystencils.backend.ast.expressions import (
     PsArrayInitList,
     PsSubscript,
     PsMemAcc,
+    PsSymbolExpr,
 )
 from pystencils.backend.constants import PsConstant
 from pystencils.backend.functions import PsMathFunction, MathFunctions, PsConstantFunction, ConstantFunctions
@@ -66,6 +68,14 @@ from pystencils.sympyextensions.integer_functions import (
     ceil_to_multiple,
     div_ceil,
 )
+from pystencils.sympyextensions.reduction import (
+    AddReductionAssignment,
+    SubReductionAssignment,
+    MulReductionAssignment,
+    MinReductionAssignment,
+    MaxReductionAssignment,
+)
+from pystencils.types import PsTypeError
 
 
 def test_freeze_simple():
@@ -519,6 +529,79 @@ def test_invalid_arrays():
         _ = freeze(symb_arr)
 
 
+@pytest.mark.parametrize("reduction_assignment_rhs_type",
+                         [
+                             (AddReductionAssignment, PsAdd),
+                             (SubReductionAssignment, PsSub),
+                             (MulReductionAssignment, PsMul),
+                             (MinReductionAssignment, PsCall),
+                             (MaxReductionAssignment, PsCall),
+                         ])
+def test_reduction_assignments(
+        reduction_assignment_rhs_type
+):
+    x = fields(f"x: float64[1d]")
+    w = TypedSymbol("w", "float64")
+
+    reduction_op, rhs_type = reduction_assignment_rhs_type
+
+    ctx = KernelCreationContext()
+    freeze = FreezeExpressions(ctx)
+
+    one = PsExpression.make(PsConstant(1, ctx.index_dtype))
+    counter = ctx.get_symbol("ctr", ctx.index_dtype)
+    ispace = FullIterationSpace(
+        ctx, [FullIterationSpace.Dimension(one, one, one, counter)]
+    )
+    ctx.set_iteration_space(ispace)
+
+    expr = freeze(reduction_op(w, 3 * x.center()))
+
+    info = ctx.find_reduction_info(w.name)
+
+    assert isinstance(expr, PsAssignment)
+    assert isinstance(expr.lhs, PsSymbolExpr)
+
+    assert expr.lhs.symbol == info.local_symbol
+    assert expr.lhs.dtype == w.dtype
+
+    assert isinstance(expr.rhs, rhs_type)
+
+
+def test_invalid_reduction_assignments():
+    x = fields(f"x: float64[1d]")
+    w = TypedSymbol("w", "float64")
+
+    assignment = Assignment(w, -1 * x.center())
+    reduction_assignment = AddReductionAssignment(w, 3 * x.center())
+
+    expected_errors_for_invalid_cases = [
+        # 1) Reduction symbol is used before ReductionAssignment.
+        #    May only be used for reductions -> KernelConstraintsError
+        ([assignment, reduction_assignment], KernelConstraintsError),
+        # 2) Reduction symbol is used after ReductionAssignment.
+        #    Reduction symbol is converted to pointer after freeze -> PsTypeError
+        ([reduction_assignment, assignment], PsTypeError),
+        # 3) Duplicate ReductionAssignment
+        #    May only be used once for now -> KernelConstraintsError
+        ([reduction_assignment, reduction_assignment], KernelConstraintsError)
+    ]
+
+    for invalid_assignment, error_class in expected_errors_for_invalid_cases:
+        ctx = KernelCreationContext()
+        freeze = FreezeExpressions(ctx)
+
+        one = PsExpression.make(PsConstant(1, ctx.index_dtype))
+        counter = ctx.get_symbol("ctr", ctx.index_dtype)
+        ispace = FullIterationSpace(
+            ctx, [FullIterationSpace.Dimension(one, one, one, counter)]
+        )
+        ctx.set_iteration_space(ispace)
+
+        with pytest.raises(error_class):
+            _ = [freeze(asm) for asm in invalid_assignment]
+
+
 def test_memory_access():
     ctx = KernelCreationContext()
     freeze = FreezeExpressions(ctx)
diff --git a/tests/nbackend/kernelcreation/test_typification.py b/tests/nbackend/kernelcreation/test_typification.py
index d5b204e9f6469d2c65c27f72cbf9f7a93ace4777..62b0106b9ab7fafc9febb25fef00ba404a3f1282 100644
--- a/tests/nbackend/kernelcreation/test_typification.py
+++ b/tests/nbackend/kernelcreation/test_typification.py
@@ -5,6 +5,7 @@ import numpy as np
 from typing import cast
 
 from pystencils import Assignment, TypedSymbol, Field, FieldType, AddAugmentedAssignment
+from pystencils.sympyextensions import ReductionOp
 from pystencils.sympyextensions.pointers import mem_acc
 
 from pystencils.backend.ast.structural import (
@@ -34,9 +35,9 @@ from pystencils.backend.ast.expressions import (
     PsTernary,
     PsMemAcc
 )
+from pystencils.backend.ast.vector import PsVecBroadcast, PsVecHorizontal
 from pystencils.backend.ast import dfs_preorder
 from pystencils.backend.ast.expressions import PsAdd
-from pystencils.backend.ast.vector import PsVecBroadcast
 from pystencils.backend.constants import PsConstant
 from pystencils.backend.functions import CFunction, PsConstantFunction, ConstantFunctions
 from pystencils.types import constify, create_type, create_numeric_type, PsVectorType, PsTypeError
@@ -744,6 +745,50 @@ def test_propagate_constant_type_in_broadcast():
         assert expr.dtype == PsVectorType(fp16, 4)
 
 
+def test_typify_horizontal_vector_reductions():
+    ctx = KernelCreationContext()
+    typify = Typifier(ctx)
+
+    reduction_op = ReductionOp.Add
+    stype = Fp(32)
+    vtype = PsVectorType(stype, 4)
+
+    def create_symb_expr(name, tpe):
+        return PsExpression.make(ctx.get_symbol(name, tpe))
+
+    # create valid horizontal and check if expression type is scalar
+    result = typify(
+        PsVecHorizontal(
+            create_symb_expr("s1", stype), create_symb_expr("v1", vtype), ReductionOp.Add
+        )
+    )
+    assert result.get_dtype() == stype
+
+    # create invalid horizontal by using scalar type for expected vector type
+    with pytest.raises(TypificationError):
+        _ = typify(
+            PsVecHorizontal(
+                create_symb_expr("s2", stype), create_symb_expr("v2", stype), reduction_op
+            )
+        )
+
+    # create invalid horizontal by using vector type for expected scalar type
+    with pytest.raises(TypificationError):
+        _ = typify(
+            PsVecHorizontal(
+                create_symb_expr("s3", vtype), create_symb_expr("v3", vtype), reduction_op
+            )
+        )
+
+    # create invalid horizontal where base type of vector does not match with scalar type
+    with pytest.raises(TypificationError):
+        _ = typify(
+            PsVecHorizontal(
+                create_symb_expr("s4", Int(32)), create_symb_expr("v4", vtype), reduction_op
+            )
+        )
+
+
 def test_inference_fails():
     ctx = KernelCreationContext()
     typify = Typifier(ctx)
diff --git a/util/generate_simd_horizontal_op.py b/util/generate_simd_horizontal_op.py
new file mode 100644
index 0000000000000000000000000000000000000000..1d652c6e1f53f5a7b060f2a725162ef07f1fdfae
--- /dev/null
+++ b/util/generate_simd_horizontal_op.py
@@ -0,0 +1,309 @@
+from enum import Enum
+
+FCT_QUALIFIERS = "inline"
+
+
+class InstructionSets(Enum):
+    SSE3 = "SSE3"
+    AVX = "AVX"
+    AVX512 = "AVX512"
+    NEON = "NEON"
+
+    def __str__(self):
+        return self.value
+
+
+class ReductionOps(Enum):
+    Add = ("add", "+")
+    Mul = ("mul", "*")
+    Min = ("min", "min")
+    Max = ("max", "max")
+
+    def __init__(self, op_name, op_str):
+        self.op_name = op_name
+        self.op_str = op_str
+
+
+class ScalarTypes(Enum):
+    Double = "double"
+    Float = "float"
+
+    def __str__(self):
+        return self.value
+
+
+class VectorTypes(Enum):
+    SSE3_128d = "__m128d"
+    SSE3_128 = "__m128"
+
+    AVX_256d = "__m256d"
+    AVX_256 = "__m256"
+    AVX_128 = "__m128"
+
+    AVX_512d = "__m512d"
+    AVX_512 = "__m512"
+
+    NEON_64x2 = "float64x2_t"
+    NEON_32x4 = "float32x4_t"
+
+    def __str__(self):
+        return self.value
+
+
+class Variable:
+    def __init__(self, name: str, dtype: ScalarTypes | VectorTypes):
+        self._name = name
+        self._dtype = dtype
+
+    def __str__(self):
+        return f"{self._dtype} {self._name}"
+
+    @property
+    def name(self) -> str:
+        return self._name
+
+    @property
+    def dtype(self) -> ScalarTypes | VectorTypes:
+        return self._dtype
+
+
+def get_intrin_from_vector_type(vtype: VectorTypes) -> InstructionSets:
+    match vtype:
+        case VectorTypes.SSE3_128 | VectorTypes.SSE3_128d:
+            return InstructionSets.SSE3
+        case VectorTypes.AVX_256 | VectorTypes.AVX_256d:
+            return InstructionSets.AVX
+        case VectorTypes.AVX_512 | VectorTypes.AVX_512d:
+            return InstructionSets.AVX512
+        case VectorTypes.NEON_32x4 | VectorTypes.NEON_64x2:
+            return InstructionSets.NEON
+
+
+def intrin_prefix(instruction_set: InstructionSets, double_prec: bool):
+    match instruction_set:
+        case InstructionSets.SSE3:
+            return "_mm"
+        case InstructionSets.AVX:
+            return "_mm256"
+        case InstructionSets.AVX512:
+            return "_mm512"
+        case InstructionSets.NEON:
+            return "vgetq" if double_prec else "vget"
+        case _:
+            raise ValueError(f"Unknown instruction set {instruction_set}")
+
+
+def intrin_suffix(instruction_set: InstructionSets, double_prec: bool):
+    if instruction_set in [InstructionSets.SSE3, InstructionSets.AVX, InstructionSets.AVX512]:
+        return "pd" if double_prec else "ps"
+    elif instruction_set in [InstructionSets.NEON]:
+        return "f64" if double_prec else "f32"
+    else:
+        raise ValueError(f"Unknown instruction set {instruction_set}")
+
+
+def generate_hadd_intrin(instruction_set: InstructionSets, double_prec: bool, v: str):
+    return f"{intrin_prefix(instruction_set, double_prec)}_hadd_{intrin_suffix(instruction_set, double_prec)}({v}, {v})"
+
+
+def generate_shuffle_intrin(instruction_set: InstructionSets, double_prec: bool, v: str, offset):
+    return f"_mm_shuffle_{intrin_suffix(instruction_set, double_prec)}({v}, {v}, {offset})"
+
+
+def generate_op_intrin(instruction_set: InstructionSets, double_prec: bool, reduction_op: ReductionOps, a: str, b: str):
+    return f"_mm_{reduction_op.op_name}_{intrin_suffix(instruction_set, double_prec)}({a}, {b})"
+
+
+def generate_cvts_intrin(double_prec: bool, v: str):
+    convert_suffix = "f64" if double_prec else "f32"
+    intrin_suffix = "d" if double_prec else "s"
+    return f"_mm_cvts{intrin_suffix}_{convert_suffix}({v})"
+
+
+def generate_fct_name(instruction_set: InstructionSets, double_prec: bool, op: ReductionOps):
+    prefix = intrin_prefix(instruction_set, double_prec)
+    suffix = intrin_suffix(instruction_set, double_prec)
+    return f"{prefix}_horizontal_{op.op_name}_{suffix}"
+
+
+def generate_fct_decl(instruction_set: InstructionSets, op: ReductionOps, svar: Variable, vvar: Variable):
+    double_prec = svar.dtype is ScalarTypes.Double
+    return f"{FCT_QUALIFIERS} {svar.dtype} {generate_fct_name(instruction_set, double_prec, op)}({svar}, {vvar}) {{ \n"
+
+
+# SSE & AVX provide horizontal add 'hadd' intrinsic that allows for specialized handling
+def generate_simd_horizontal_add(scalar_var: Variable, vector_var: Variable):
+    reduction_op = ReductionOps.Add
+    instruction_set = get_intrin_from_vector_type(vector_var.dtype)
+    double_prec = scalar_var.dtype is ScalarTypes.Double
+
+    sname = scalar_var.name
+    vtype = vector_var.dtype
+    vname = vector_var.name
+
+    simd_op = lambda a, b: generate_op_intrin(instruction_set, double_prec, reduction_op, a, b)
+    hadd = lambda var: generate_hadd_intrin(instruction_set, double_prec, var)
+    cvts = lambda var: generate_cvts_intrin(double_prec, var)
+
+    # function body
+    body = f"\t{vtype} _v = {vname};\n"
+    match instruction_set:
+        case InstructionSets.SSE3:
+            if double_prec:
+                body += f"\treturn {sname} + {cvts(hadd('_v'))};\n"
+            else:
+                body += f"\t{vtype} _h = {hadd('_v')};\n" \
+                        f"\treturn {sname} + {cvts(simd_op('_h', '_mm_movehdup_ps(_h)'))};\n"
+
+        case InstructionSets.AVX:
+            if double_prec:
+                body += f"\t{vtype} _h = {hadd('_v')};\n" \
+                        f"\treturn {sname} + {cvts(simd_op('_mm256_extractf128_pd(_h,1)', '_mm256_castpd256_pd128(_h)'))};\n"
+            else:
+                add_i = "_mm_hadd_ps(_i,_i)"
+                body += f"\t{vtype} _h = {hadd('_v')};\n" \
+                        f"\t__m128  _i = {simd_op('_mm256_extractf128_ps(_h,1)', '_mm256_castps256_ps128(_h)')};\n" \
+                        f"\treturn {sname} + {cvts(add_i)};\n"
+
+        case _:
+            raise ValueError(f"No specialized version of horizontal_add available for {instruction_set}")
+
+    # function decl
+    decl = generate_fct_decl(instruction_set, reduction_op, scalar_var, vector_var)
+
+    return decl + body + "}\n"
+
+
+def generate_simd_horizontal_op(reduction_op: ReductionOps, scalar_var: Variable, vector_var: Variable):
+    instruction_set = get_intrin_from_vector_type(vector_var.dtype)
+    double_prec = scalar_var.dtype is ScalarTypes.Double
+
+    # generate specialized version for add operation
+    if reduction_op == ReductionOps.Add and instruction_set in [InstructionSets.SSE3, InstructionSets.AVX]:
+        return generate_simd_horizontal_add(scalar_var, vector_var)
+
+    sname = scalar_var.name
+    stype = scalar_var.dtype
+    vtype = vector_var.dtype
+    vname = vector_var.name
+
+    opname = reduction_op.op_name
+    opstr = reduction_op.op_str
+
+    reduction_function = f"f{opname}" \
+        if reduction_op in [ReductionOps.Max, ReductionOps.Min] else None
+
+    simd_op = lambda a, b: generate_op_intrin(instruction_set, double_prec, reduction_op, a, b)
+    cvts = lambda var: generate_cvts_intrin(double_prec, var)
+    shuffle = lambda var, offset: generate_shuffle_intrin(instruction_set, double_prec, var, offset)
+
+    # function body
+    body = f"\t{vtype} _v = {vname};\n" if instruction_set != InstructionSets.AVX512 else ""
+    match instruction_set:
+        case InstructionSets.SSE3:
+            if double_prec:
+                body += f"\t{stype} _r = {cvts(simd_op('_v', shuffle('_v', 1)))};\n"
+            else:
+                body += f"\t{vtype} _h = {simd_op('_v', shuffle('_v', 177))};\n" \
+                        f"\t{stype} _r = {cvts(simd_op('_h', shuffle('_h', 10)))};\n"
+
+        case InstructionSets.AVX:
+            if double_prec:
+                body += f"\t__m128d _w = {simd_op('_mm256_extractf128_pd(_v,1)', '_mm256_castpd256_pd128(_v)')};\n" \
+                        f"\t{stype} _r = {cvts(simd_op('_w', '_mm_permute_pd(_w,1)'))}; \n"
+            else:
+                body += f"\t__m128 _w = {simd_op('_mm256_extractf128_ps(_v,1)', '_mm256_castps256_ps128(_v)')};\n" \
+                        f"\t__m128 _h = {simd_op('_w', shuffle('_w', 177))};\n" \
+                        f"\t{stype} _r = {cvts(simd_op('_h', shuffle('_h', 10)))};\n"
+
+        case InstructionSets.AVX512:
+            suffix = intrin_suffix(instruction_set, double_prec)
+            body += f"\t{stype} _r = _mm512_reduce_{opname}_{suffix}({vname});\n"
+
+        case InstructionSets.NEON:
+            if double_prec:
+                body += f"\t{stype} _r = vgetq_lane_f64(_v,0);\n"
+                if reduction_function:
+                    body += f"\t_r = {reduction_function}(_r, vgetq_lane_f64(_v,1));\n"
+                else:
+                    body += f"\t_r {opstr}= vgetq_lane_f64(_v,1);\n"
+            else:
+                body += f"\tfloat32x2_t _w = v{opname}_f32(vget_high_f32(_v), vget_low_f32(_v));\n" \
+                        f"\t{stype} _r = vgetq_lane_f32(_w,0);\n"
+                if reduction_function:
+                    body += f"\t_r = {reduction_function}(_r, vget_lane_f32(_w,1));\n"
+                else:
+                    body += f"\t_r {opstr}= vget_lane_f32(_w,1);\n"
+
+        case _:
+            raise ValueError(f"Unsupported instruction set {instruction_set}")
+
+    # finalize reduction
+    if reduction_function:
+        body += f"\treturn {reduction_function}(_r, {sname});\n"
+    else:
+        body += f"\treturn {sname} {opstr} _r;\n"
+
+    # function decl
+    decl = generate_fct_decl(instruction_set, reduction_op, scalar_var, vector_var)
+
+    return decl + body + "}\n"
+
+
+stypes = {
+    True: ScalarTypes.Double,
+    False: ScalarTypes.Float
+}
+
+vtypes_for_instruction_set = {
+    InstructionSets.SSE3: {
+        True: VectorTypes.SSE3_128d,
+        False: VectorTypes.SSE3_128
+    },
+    InstructionSets.AVX: {
+        True: VectorTypes.AVX_256d,
+        False: VectorTypes.AVX_256
+    },
+    InstructionSets.AVX512: {
+        True: VectorTypes.AVX_512d,
+        False: VectorTypes.AVX_512
+    },
+    InstructionSets.NEON: {
+        True: VectorTypes.NEON_64x2,
+        False: VectorTypes.NEON_32x4
+    },
+}
+
+guards_for_instruction_sets = {
+    InstructionSets.SSE3: "__SSE3__",
+    InstructionSets.AVX: "__AVX__",
+    InstructionSets.AVX512: '__AVX512F__',
+    InstructionSets.NEON: '_M_ARM64',
+}
+
+code = """#pragma once
+
+#include <cmath>
+
+"""
+
+for instruction_set in InstructionSets:
+    code += f"#if defined({guards_for_instruction_sets[instruction_set]})\n"
+
+    if instruction_set in [InstructionSets.SSE3, InstructionSets.AVX, InstructionSets.AVX512]:
+        code += "#include <immintrin.h>\n\n"
+    elif instruction_set == InstructionSets.NEON:
+        code += "#include <arm_neon.h>\n\n"
+    else:
+        ValueError(f"Missing header include for instruction set {instruction_set}")
+
+    for reduction_op in ReductionOps:
+        for double_prec in [True, False]:
+            scalar_var = Variable("dst", stypes[double_prec])
+            vector_var = Variable("src", vtypes_for_instruction_set[instruction_set][double_prec])
+
+            code += generate_simd_horizontal_op(reduction_op, scalar_var, vector_var) + "\n"
+
+    code += "#endif\n\n"
+
+print(code)