Scope of declared variables inside __global__ kernels.

Let’s say when launching <<<grid, block>>> instances of cudaKernel() where inside I have conducted the following declaration:

__global__ void cudaKernel(..., int *vectorSize)
    float *myVector = new float[vectorSize];
    for (int i = 0; i<vectorSize; ++i)
        myVector[i] = some data

will each thread have its own instance of that “myVector” or will all threads within a block use that same instance of myVector or will all threads spanning over all blocks use the very same instance of “myVector”? I did add the line:

printf("Thread: %d; address of myVector: %p", threadId, &myVector);

But all threads returned “…3FFFC48” as address, perhaps that address is localized for each thread. My intention is to let each thread access its own set of 8 vectors, each with 10-20 elements, 4 of them using floats and the other 4 using ints, and 15-17 variables of different types (int, __int64, float, curandState, …) including those used in for loops.

Am I missing something? I want to ensure that all vectors and variables stay within the cache and registers so as to avoid global memory accesses. There must be enough room in local memory to house data for a few thousands of these threads…

Yes, each thread will have its own instance of myVector.

There are various limits on the total amount of local memory that is available in this fashion, these limits are published in the programming guide.

If the compiler can detect at compile time that the limit has been exceeded, it will warn also.

Thank you, that information was very useful in my troubleshooting process.

Actually, I need to make some corrections. Each thread will have its own instance of myVector independent of the following.

  1. This is an “ordinary” local memory definition:
float myVector[vectorSize];

such an allocation will be limited by the “local memory” limits given in the programming guide:

(“amount of local memory per thread”)

  1. However a dynamic allocation like this:
float *myVector = new float[vectorSize];

comes out of the “device heap”, and it has a different set of requirements associated with it:

(it has the same behavioral rules as in-kernel malloc)

This second case is what you are showing, and the key caveat I would point out is that the device heap memory space is initially limited to a total of 8MB across all such allocations (all threads that allocate in this fashion draw against the same 8MB space, although each thread will have a separate independent allocation) and so you should be aware of the limit. It can be raised with a cuda runtime API call.

Either method, as I already stated, will be separate and independent for each thread.