The motivation for my question is - I am translating C++ functions with singly and doubly nested for loops to CUDA.
How should I allocate the threads in x and y dimensions?
In the CUDA function, I assume I am correctly assigning them as -
int64_t threadx_dim = blockIdx.x * blockDim.x + threadIdx.x;
int64_t thready_dim = blockIdx.y * blockDim.y + threadIdx.y;
But I am not sure if I am allocating them correctly on the host function.
When I was translating C++ functions with just single for loops or multiple for loops where none of them were nested, I did -
if (std::max(arg1, arg2) > 1024) {
blocks_per_grid = dim3(ceil(std::max(arg1, arg2) / 1024.0), 1, 1);
threads_per_block = dim3(1024, 1, 1);
} else {
blocks_per_grid = dim3(1, 1, 1);
threads_per_block = dim3(std::max(arg1, arg2), 1, 1);
}
Where arg1
and arg2
are the limits of two non nested for loops.
When it comes to assigning threads for the Y dimension, I could represent the allocation like this -
So, it might look something like this -
if (arg1 > 1024 && arg2 > 1024) {
blocks_per_grid = dim3(ceil(arg1 / 1024.0), ceil(arg2/1024.0), 1);
threads_per_block = dim3(1024, 1024, 1);
} else if (arg1 > 1024) {
blocks_per_grid = dim3(ceil(arg1 / 1024.0), 1, 1);
threads_per_block = dim3(1024, arg2, 1);
} else if (arg2 > 1024) {
blocks_per_grid = dim3(1, ceil(arg2/1024.0), 1);
threads_per_block = dim3(arg1, 1024, 1);
} else {
blocks_per_grid = dim3(1, 1, 1);
threads_per_block = dim3(arg1, arg2, 1);
}
Where arg1
is the outer loop and arg2
is the inner loop.
Is there a cleaner way to do this?