stupid memory question

PTThompson · July 20, 2009, 7:00pm

Ok, this is probably in the documentation somewhere, but I either am not understanding what I’m reading, or I just haven’t found it yet:

Is it possible to get a chunk of memory that is shared inside the thread only, and that I won’t have to pass around from device function to function as an argument? i.e., I have my kernel, which calls other device functions, and I want to have the equivalent of global variables accessible to them, but I want a different instantiation for each thread. My impression is that global, constant and shared declarations get executed once per kernel call, and I want my memory chunked out once per thread. The way I’m fearing I’ll have to do this is by just doing a cudaMalloc from the host with enough space for all my threads, then pass the pointer to the kernel, and force each thread to figure out which chunk of the memory belongs to it. That seems like a lot of hassle.

Thanks,

paul

nitin.life · July 20, 2009, 7:47pm

There is always the per thread local memory , its like the global memory but access is always coalesced. I am not exactly sure if that will help you.

Thanks

Sarnath · July 21, 2009, 4:41am

Heard of global memory, shared memory, local memory etc… What is stupid memory? :-)

nitin.life · July 21, 2009, 4:46am

The memory which produces segfaults :)

Sarnath · July 21, 2009, 5:11am

Ok, this is probably in the documentation somewhere, but I either am not understanding what I’m reading, or I just haven’t found it yet:

Is it possible to get a chunk of memory that is shared inside the thread only, and that I won’t have to pass around from device function to function as an argument? i.e., I have my kernel, which calls other device functions, and I want to have the equivalent of global variables accessible to them, but I want a different instantiation for each thread. My impression is that global, constant and shared declarations get executed once per kernel call, and I want my memory chunked out once per thread. The way I’m fearing I’ll have to do this is by just doing a cudaMalloc from the host with enough space for all my threads, then pass the pointer to the kernel, and force each thread to figure out which chunk of the memory belongs to it. That seems like a lot of hassle.

Thanks,

paul

Since I took the liberty to make some fun, I will offset it by answering this question.

Local memory is good enough. You do NOT have an explicit qualifier that would put variables into local memory. (unlike shared mem or global mem)

All local variables (like int i, j;) in your kernel are allocated to registers. However, When the kernel’s register usage grows beyond a point,

the compiler would automatically move local variables (like int i,j;) to local memory. You do NOT have control on what goes where.
Arrays declared locally which are indexed by variables ALWAYS go into LOCAL MEMORY (like int my_thread_data[1000])
There are some methods of reducing “register” count. For example declaring “local variables” as volatile seems to reduce register usage of kernel

according to a few. But do NOT count on this. This is NOT an official stand. The next CUDA release might behave differently.
NVCC provides an option “-maxrregcount” (yes, double ‘r’) to enforce the number of registers for ALL kernels in a CU file

– Extra variables will be stored in local memory

– Note the caveat – All kernels in the CU file will be affected
LOCAL MEMORY is considered slow.
The per-thread global memory is a good idea – and that is how Local memory too works.

– Somewhere in the middle, I think, CUDA ensures that the local memory accesses are COALESCED.

– So, Dont even bother to allocate per-thread global memory – and even if you allocate make sure that you access them in a coalesced way
```
 i.e. succesive elements of each thread must be physically offset by blockDim.x*gridDim.x*element_size_of_the_array
```
– btw, That idea is not a hassle. I had implemented this for a project myself. I made sure the accesses are all coalesced.

Hope this helped, Paul!

Best REgards,

Sarnath

PTThompson · July 21, 2009, 2:57pm

Since I took the liberty to make some fun, I will offset it by answering this question.

Local memory is good enough. You do NOT have an explicit qualifier that would put variables into local memory. (unlike shared mem or global mem)
All local variables (like int i, j;) in your kernel are allocated to registers. However, When the kernel’s register usage grows beyond a point,

the compiler would automatically move local variables (like int i,j;) to local memory. You do NOT have control on what goes where.

Arrays declared locally which are indexed by variables ALWAYS go into LOCAL MEMORY (like int my_thread_data[1000])

There are some methods of reducing “register” count. For example declaring “local variables” as volatile seems to reduce register usage of kernel

according to a few. But do NOT count on this. This is NOT an official stand. The next CUDA release might behave differently.

NVCC provides an option “-maxrregcount” (yes, double ‘r’) to enforce the number of registers for ALL kernels in a CU file

– Extra variables will be stored in local memory

– Note the caveat – All kernels in the CU file will be affected

LOCAL MEMORY is considered slow.
The per-thread global memory is a good idea – and that is how Local memory too works.

– Somewhere in the middle, I think, CUDA ensures that the local memory accesses are COALESCED.

– So, Dont even bother to allocate per-thread global memory – and even if you allocate make sure that you access them in a coalesced way
 i.e. succesive elements of each thread must be physically offset by blockDim.x*gridDim.x*element_size_of_the_array
– btw, That idea is not a hassle. I had implemented this for a project myself. I made sure the accesses are all coalesced.
Hope this helped, Paul!

Best REgards,

Sarnath

Haha, thanks. I’m still confused though. What I want to do is use static variables, but for some reason I got the impression from the programming guide that plain old static declarations don’t work. However, last night I whipped up a program, used a static, altered it both from a device function and the kernel, and it ran in emulation mode. However, I’m not sure if this is doing what I want it to (creating an instance of the static variable for each thread, and keeping it private to that thread). What do you guys think?

Thanks,

paul

PTThompson · July 21, 2009, 2:58pm

definitely. XD

Sarnath · July 22, 2009, 7:19am

Glad u took it easy… Thanks!

All local variables are private to thread… Just pass it as an argument to the device function… That will work Iguess…

If in doubt, look @ the PTX – device functions are a bit tricky.

PTThompson · July 27, 2009, 6:45pm

Ok, it looks like static variables work fine so long as I am in emulation mode, but they don’t seem to work once I turn device emulation off. Does this mean I have to pass them around in a struct or something?

PTThompson · July 27, 2009, 7:04pm

I take that back. In my dummy program, statics work just fine. In my real program though, I’m getting errors like:

/Users/TraxusIV/Documents/Programming/Projects/cuLsoda/cuLsoda.cu(2455): error: identifier “jstart” is undefined

for every one of my static variables. they are defined right up at the top of the file, before they are ever called, so I don’t get what’s going on here. Anyone have any ideas?

PTThompson · July 27, 2009, 8:27pm

GAH!!! Ok, I just tried to put all my common variables in a struct, and pass that around by using a struct pointer, and doncha know, nvcc hates it. I have seriously like 50 or more variables (some of which are arrays) which need to be shared between the different functions that each thread calls. How do I get these guys ported around without having to do a massive rewrite?

thanks

Paul

PTThompson · July 28, 2009, 4:10pm

… disregard (deleting this comment)

Topic		Replies	Views
Kernel requiring large number of parameters CUDA Programming and Performance	14	8581	September 5, 2008
Accessing/caching access to global/pinned memory CUDA Programming and Performance	10	690	May 29, 2023
Beginer question Thread synchronization with shared memory CUDA Programming and Performance	35	9312	April 6, 2010
Using Shared Memory in CUDA C/C++ Technical Blog	36	1889	October 8, 2020
passing an array to a kenel ? CUDA Programming and Performance	9	13001	June 10, 2009
Dynamic Shared Memory allocation of more than one array CUDA Programming and Performance	4	4312	June 20, 2011
Getting wrong output from CUDA kernel CUDA Programming and Performance	6	8278	April 15, 2011
Slow local memory, feigned constant memory. coalesced? global? CUDA Programming and Performance	29	7231	January 25, 2010
Many threads updating a single flag in global memory CUDA Programming and Performance	13	6479	May 9, 2011
Shared memory question CUDA Programming and Performance	27	7309	June 23, 2008

stupid memory question

Related topics