Let’s say when launching <<<grid, block>>> instances of cudaKernel() where inside I have conducted the following declaration:
__global__ void cudaKernel(..., int *vectorSize)
{
...
float *myVector = new float[vectorSize];
...
for (int i = 0; i<vectorSize; ++i)
myVector[i] = some data
...
}
will each thread have its own instance of that “myVector” or will all threads within a block use that same instance of myVector or will all threads spanning over all blocks use the very same instance of “myVector”? I did add the line:
printf("Thread: %d; address of myVector: %p", threadId, &myVector);
But all threads returned “…3FFFC48” as address, perhaps that address is localized for each thread. My intention is to let each thread access its own set of 8 vectors, each with 10-20 elements, 4 of them using floats and the other 4 using ints, and 15-17 variables of different types (int, __int64, float, curandState, …) including those used in for loops.
Am I missing something? I want to ensure that all vectors and variables stay within the cache and registers so as to avoid global memory accesses. There must be enough room in local memory to house data for a few thousands of these threads…
(it has the same behavioral rules as in-kernel malloc)
This second case is what you are showing, and the key caveat I would point out is that the device heap memory space is initially limited to a total of 8MB across all such allocations (all threads that allocate in this fashion draw against the same 8MB space, although each thread will have a separate independent allocation) and so you should be aware of the limit. It can be raised with a cuda runtime API call.
Either method, as I already stated, will be separate and independent for each thread.