Local memory array giving illegal access error

I’m scratching my head at this. I have a kernel which does some intensive computation on each element of my input. Each element has n channels where n is variable (from 1-16 ish) but consistant per input array.

I thought that the best way to do this is to load the element from global memory into a local array, perform your computation with the low latency reads and writes, then dump it back to global memory.

However running this for 50,000 threads with a channel count of 3 throws an illegal memory access error. I assume I’m hitting some memory allocation limit.

As its a dynamic array it can’t be allocated in the registers. I would assume it would fall into local memory, but I would of thought each thread would have enough local memory for 3 floats?

I have solved this by simply operating directly in global memory but my belief is this is far from ideal due to the added latency. (I’m coming from DirectCompute where I know this is a problem and I assume it is in cuda too)

A side note is that im running in WSL2. If this is a odd occurance I’ll move this over to there as a potential bug, but I’ll assume I’m at fault not Nvidia for now.

All thoughts and explanation welcome


#include <cuda.h>
#include <cuda_runtime.h>

__constant__ int cChannelCount;

__global__ void DynamicArrayKernel(float* input, float* output)
    int dispatchIdx = blockIdx.x * blockDim.x + threadIdx.x;
    int memOffset = dispatchIdx * cChannelCount;

    float* localArray = new float[cChannelCount];

    // Read from global
    for(int i = 0; i < cChannelCount; i++)
        localArray[i] = input[memOffset  + i];

    // ...
    // Code operating in local memory
    // ...

    // Write to global
    for(int i = 0; i < cChannelCount; i++)
        output[memOffset  + i] = localArray[i];

void TestCase(int elementCount, int channelCount)
    float* input;
    float* output;

    cudaMalloc(&input, sizeof(elementCount * channelCount));
    cudaMalloc(&output, sizeof(elementCount * channelCount));

    cudaMemcpyToSymbol(cChannelCount, &channelCount, sizeof(int));

    DynamicArrayKernel<<<elementCount, 1>>>(input, output);


// Fine
TestCase(10000, 3);

// Fails
TestCase(50000, 3);

use cudaSetDeviceLimit [https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html] to set limits for the memory heap.

what you call “local” memory is actually residing in the same physical memory as the global memory space. It goes through the same cache hierarchy and has no performance benefits. In fact, the copying operations you do may actually create an overall slowdown.

You should read up on the use of “shared memory” which is local to each multiprocessor and has a much higher throughput and lower latency compared to global memory accesses. The main drawback: there’s limited space available. It’s typically 64-128kb per multiprocessor, with details varying among GPU models.

If there is no more memory to allocate, new returns nullptr.
The heap used for device-side memory allocation has limited, preconfigured size. It can be increased via API call cudaDeviceSetLimit.

From a performance perspective, memory allocated from within a kernel via new is essentially global memory.

Edit: cbuchner1 was quicker to answer.

Thanks for your replies.

I hadn’t considered using shared memory as each thread only needs the data from its own element to run. I guess it would allow me to be confident in where the array is allocated.

As for the arrays being alocated in global memory, I can see how this would remove my expected performance gains but I have plenty of free memory available to me in global so I dont see why it would throw an error in this case.

Edit: After checking with asserts it seams the potentially missnamed “localArray” is not nullptr

three floats * 50000 is not enough to exceed the default limit of 8MB for the device heap.

You might wish to try the method here to localize the error. You also may be hitting a kernel timeout. I would recommend proper CUDA error checking also, you don’t seem to be using that.

As an aside, your use of the device heap is not likely to give you better performance than correct use of global memory.

There’s no way for me to tell if it may be WSL2 specific since you’ve not provided a complete code that I could run and test in a non-WSL2 environment.