Is there any way of exposing local memory to other threads inside a warp?

Hello,

I am currently designing a tree traversal on the GPU that takes inspiration from http://devblogs.nvidia.com/parallelforall/thinking-parallel-part-ii-tree-traversal-gpu/

This uses a “stack” in local memory for each node explored.

So I want to take advantage by the fact that each thread in the warp is running concurrently to do parallel work on each of the threads node.

So is there a way to expose the local memory to the other threads without putting it to shared/global memory , e.g. passing a pointer to the local memory.

I am not sure this makes sense. The “local” in “local memory” means “thread local”. This is the equivalent of TLS in host code and what is stored there is intended to be private to each thread.
Shared memory on the other hand is memory that is intended to be shared between the threads in a thread block, so why not use that?

The reason I cant use shared memory is that it is to small and moving the memory will be to slow as only one thread can move from local to shared memory.

Only one thread can move from local memory to any other destination. It is thread-local memory.

If shared memory is too small, the only other multi-thread accessible resource (of that scale) is global memory. Large data structures that are resident in local memory will have approximately the same access patterns as global memory anyway. For example, if I have:

__device__ int g_data[NUM_THREADS*4096];

__global__ void mykernel(...){

int data[4096];

int my_global_thread_idx = threadIdx.x + blockDim.x*blockIdx.x;
...

for (int i = 0; i < 4096; i++)
  g_data[(i*4096)+(my_global_thread_idx)] = data[i];

...
}

As long as all threads in a warp are running the above code, the data copy will be occurring from off-chip memory to off-chip memory (local to global) but all accesses (both read and write) will be fully coalesced. Whether or not this particular global storage pattern is convenient would depend on your code.

The reason I cant use shared memory is that it is to small
Hmm but if the data will not fit into shared memory it probably will
not fit into the caches used for local memory either. Hence access
to local memory may be very slow.

and moving the memory will be to slow as only one thread can move from local to shared memory.
There is no need to move data from shared to local memory.
Each thread has full read/write access to shared memory.
In general, all the threads can be used. That is, more than one
thread can be used.
Bill