Implementing a 'for' template kernel

Hello,
I’ve been thinkering a lot with Cuda lately, and since the program I’m trying to migrate from serial-CPU to parallel-GPU is very complex an has many fors that could be parallelized, I wanted to create a ‘drop in’ template to parallelize all the fors.

After trying out basically everything, the only approach I found that works is the following:

#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
using namespace std;

const int MAX_THREAD = 1024; 

template<typename Lambda>
__global__ void cudaForInstance(int size, Lambda lambda) {
    int index = blockIdx.x * MAX_THREAD + threadIdx.x;
    if(index < size) lambda(index);
}

template<typename Lambda>
__host__ __device__ void cudaFor(int size, Lambda lambda, bool synchronize = false) {
    cudaForInstance<<<size / MAX_THREAD + 1, MAX_THREAD>>>(size, lambda);
    if(synchronize) cudaDeviceSynchronize();
}

template<typename Lambda>
__device__ void nestedCudaFor(int size, Lambda lambda, bool synchronize = false) {
    cudaFor(size, [lambda](int i) -> void { lambda(i); }, synchronize);
}

int main() {
    int size = 5;
    cudaFor(size, [=]__device__(int i) -> void {
        nestedCudaFor(size, [=]__device__(int j) -> void {
            printf("(%d, %d)\t", i, j);
            if (j == size - 1)
                printf("\n");
        });
    });

    return 0;
}

It requires:

  • Generate Relocatable Device Code: YES (-rdc=true)
  • –extended-lambda
  • -D CUDA_FORCE_CDP1_IF_SUPPORTED (this is the lazy way, if you actually need the synchronization part, it could be implemented using a second Lambda ‘callback’ fuction to execute as a <<<1,1>>> kernel after the for-kernel, as stated here)

And there is a distinction to be made between cudaFor and nestedCudaFor:

  • cudaFor: can be called from either host or device functions, cannot be called by inside either cudaFor or nestedCudaFor.
  • nestedCudaFor: can only be called from device functions, can be called by inside either cudaFor or nestedCudaFor.

The usage is quite simple as it uses C++ lambda syntax, you need to pass a simple lambda that receives the integer position between zero and the size you specify (0 <= index < size).
There are few extra constraints tho:

  1. Of course, the dynamically allocated data must have been allocated as usual using cudaMalloc.
  2. When calling cudaFor it’s not required to use the device specifier if the function you’re inside is of type device or global, it is if it’s host; Therefore when calling nestedCudaFor the device specifier is always optional.
  3. Never ever pass any context variable by reference to the lambda you create, always by value, you can pass pointers ofc, but never let a “&” appear inside those square brackets as it can lead to any kind of unexpected behaviour.

In theory this implementation should also have almost the same performance of creating the kernel functions by hand, since cuda’s compiler should be expanding the template functions into the correctly typed ones during compilation (since this is the only reason it even works at all, compared to other shenanigans I tried, like valradic functions), and then you’re basically making the compiler write all the boring stuff in your place!

I wanted to post this because I couldn’t find any implementation like this online and maybe it might be useful to someone else, but also because I’m curious if you guys know better ways to achieve this.