stupid memory question

Ok, this is probably in the documentation somewhere, but I either am not understanding what I’m reading, or I just haven’t found it yet:

Is it possible to get a chunk of memory that is shared inside the thread only, and that I won’t have to pass around from device function to function as an argument? i.e., I have my kernel, which calls other device functions, and I want to have the equivalent of global variables accessible to them, but I want a different instantiation for each thread. My impression is that global, constant and shared declarations get executed once per kernel call, and I want my memory chunked out once per thread. The way I’m fearing I’ll have to do this is by just doing a cudaMalloc from the host with enough space for all my threads, then pass the pointer to the kernel, and force each thread to figure out which chunk of the memory belongs to it. That seems like a lot of hassle.



There is always the per thread local memory , its like the global memory but access is always coalesced. I am not exactly sure if that will help you.


Heard of global memory, shared memory, local memory etc… What is stupid memory? :-)

The memory which produces segfaults :)

Since I took the liberty to make some fun, I will offset it by answering this question.

Local memory is good enough. You do NOT have an explicit qualifier that would put variables into local memory. (unlike shared mem or global mem)

  1. All local variables (like int i, j;) in your kernel are allocated to registers. However, When the kernel’s register usage grows beyond a point,

    the compiler would automatically move local variables (like int i,j;) to local memory. You do NOT have control on what goes where.

  2. Arrays declared locally which are indexed by variables ALWAYS go into LOCAL MEMORY (like int my_thread_data[1000])

  3. There are some methods of reducing “register” count. For example declaring “local variables” as volatile seems to reduce register usage of kernel

    according to a few. But do NOT count on this. This is NOT an official stand. The next CUDA release might behave differently.

  4. NVCC provides an option “-maxrregcount” (yes, double ‘r’) to enforce the number of registers for ALL kernels in a CU file

    – Extra variables will be stored in local memory

    – Note the caveat – All kernels in the CU file will be affected

  5. LOCAL MEMORY is considered slow.

  6. The per-thread global memory is a good idea – and that is how Local memory too works.

    – Somewhere in the middle, I think, CUDA ensures that the local memory accesses are COALESCED.

    – So, Dont even bother to allocate per-thread global memory – and even if you allocate make sure that you access them in a coalesced way

     i.e. succesive elements of each thread must be physically offset by blockDim.x*gridDim.x*element_size_of_the_array

    – btw, That idea is not a hassle. I had implemented this for a project myself. I made sure the accesses are all coalesced.

Hope this helped, Paul!

Best REgards,


Haha, thanks. I’m still confused though. What I want to do is use static variables, but for some reason I got the impression from the programming guide that plain old static declarations don’t work. However, last night I whipped up a program, used a static, altered it both from a device function and the kernel, and it ran in emulation mode. However, I’m not sure if this is doing what I want it to (creating an instance of the static variable for each thread, and keeping it private to that thread). What do you guys think?



definitely. XD

Glad u took it easy… Thanks!

All local variables are private to thread… Just pass it as an argument to the device function… That will work Iguess…

If in doubt, look @ the PTX – device functions are a bit tricky.

Ok, it looks like static variables work fine so long as I am in emulation mode, but they don’t seem to work once I turn device emulation off. Does this mean I have to pass them around in a struct or something?

I take that back. In my dummy program, statics work just fine. In my real program though, I’m getting errors like:

/Users/TraxusIV/Documents/Programming/Projects/cuLsoda/ error: identifier “jstart” is undefined

for every one of my static variables. they are defined right up at the top of the file, before they are ever called, so I don’t get what’s going on here. Anyone have any ideas?

GAH!!! Ok, I just tried to put all my common variables in a struct, and pass that around by using a struct pointer, and doncha know, nvcc hates it. I have seriously like 50 or more variables (some of which are arrays) which need to be shared between the different functions that each thread calls. How do I get these guys ported around without having to do a massive rewrite?



… disregard (deleting this comment)