I’m scratching my head at this. I have a kernel which does some intensive computation on each element of my input. Each element has n channels where n is variable (from 1-16 ish) but consistant per input array.
I thought that the best way to do this is to load the element from global memory into a local array, perform your computation with the low latency reads and writes, then dump it back to global memory.
However running this for 50,000 threads with a channel count of 3 throws an illegal memory access error. I assume I’m hitting some memory allocation limit.
As its a dynamic array it can’t be allocated in the registers. I would assume it would fall into local memory, but I would of thought each thread would have enough local memory for 3 floats?
I have solved this by simply operating directly in global memory but my belief is this is far from ideal due to the added latency. (I’m coming from DirectCompute where I know this is a problem and I assume it is in cuda too)
A side note is that im running in WSL2. If this is a odd occurance I’ll move this over to there as a potential bug, but I’ll assume I’m at fault not Nvidia for now.
All thoughts and explanation welcome
Thanks
#include <cuda.h>
#include <cuda_runtime.h>
__constant__ int cChannelCount;
__global__ void DynamicArrayKernel(float* input, float* output)
{
int dispatchIdx = blockIdx.x * blockDim.x + threadIdx.x;
int memOffset = dispatchIdx * cChannelCount;
float* localArray = new float[cChannelCount];
// Read from global
for(int i = 0; i < cChannelCount; i++)
{
localArray[i] = input[memOffset + i];
}
// ...
// Code operating in local memory
// ...
// Write to global
for(int i = 0; i < cChannelCount; i++)
{
output[memOffset + i] = localArray[i];
}
}
void TestCase(int elementCount, int channelCount)
{
float* input;
float* output;
cudaMalloc(&input, sizeof(elementCount * channelCount));
cudaMalloc(&output, sizeof(elementCount * channelCount));
cudaMemcpyToSymbol(cChannelCount, &channelCount, sizeof(int));
DynamicArrayKernel<<<elementCount, 1>>>(input, output);
cudaFree(input);
cudaFree(output);
}
// Fine
TestCase(10000, 3);
// Fails
TestCase(50000, 3);