Skip to content
Snippets Groups Projects

Add CUDA support

Merged Markus Holzer requested to merge CUDA into master

This MR adds cuda support

remaining to-dos

  • Fix compilation problems and add required NVCC flags
  • Add launch bound option
Edited by Markus Holzer

Merge request reports

Merge request pipeline #55674 skipped

Merge request pipeline skipped for 879ee872

Merged by Markus HolzerMarkus Holzer 9 months ago (Sep 18, 2024 11:53am UTC)

Loading

Pipeline #69083 passed

Pipeline passed for fcfbef80 on master

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
  • Jan Hönig
  • Jan Hönig approved this merge request

    approved this merge request

  • Christoph Alt added 2 commits

    added 2 commits

    • 3300460d - made the gpu test more streamlined with the cpu tests
    • 3e930f35 - removed some code duplication between benchmark and benchmark_gpu

    Compare with previous version

  • Jan Hönig
  • Christoph Alt added 2 commits

    added 2 commits

    • 24f81cf6 - added submodules from cpu and gpu benchmark generation
    • 857f1848 - removed the mutable default argument from the _kernel_header and

    Compare with previous version

  • Jan Hönig resolved all threads

    resolved all threads

  • The code looks really nice now and easily extendible to other platforms/compilers. Is the second TODO in the PRs description also done?

  • Jan Hönig marked this merge request as ready

    marked this merge request as ready

  • Thank you and thank you for your review :smile: I am not really sure what it is meant there. In principle it is possible to configure the cuda_block_size within the generate_benchmark call. But as I see now this parameter is not used at all

  • It means this:

    __global__ void
    __launch_bounds__(MAX_THREADS_PER_BLOCK, MIN_BLOCKS_PER_MP)
    fooKernel(int *inArr, int *outArr)
    {
        // ... Computation of kernel
    }

    So that you can add __launch_bounds__ as an optional argument. It was pretty important for example on AMD GPUs to limit register usage with LBM kernels.

    In most cases, you don't need the second argument. However, when using pystencils standalone it can add the launch bounds option via cupy, but since this is not what we are doing here we need to add it manually as a tuning parameter.

    Edited by Markus Holzer
  • Markus Holzer approved this merge request

    approved this merge request

  • Would it make sense to also add ROCm support? Mostly this would be just a renaming for example: #include <cuda_runtime.h> --> #include <hip_runtime.h> .

    I'm not sure if it is better to add this in a second MR or directly here?

    Edited by Markus Holzer
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Please register or sign in to reply
    Loading