Occupancy and memory

jeetchauhan · March 25, 2010, 1:36am

Hi,

I am new to CUDA so bear with me. I am doing some number theory stuff with CUDA. My problem is that I will be (or can) creating lots of threads (potentially in the order of 10^9). Each thread computes something and if it finds a result, it needs about 6*4bytes of storage. Now, on my hardware, I would run out of memory if I considered the worst case and allocated memory for each thread to store a result. (Usually thats what I would do as I could calculate an ID for each thread and use that as an offset into my allocated memory).
Therefore, I am unable to launch a kernel with a huge number of threads. I do have an estimate though, as to how many results I may get. Is there a way for me to allocate memory according to this estimate and have all threads “append” their result? The threads will obviously span across thread blocks and I cannot synchronize across thread blocks. What are my options? Is there a type of memory that will help me do something like this?
Thanks to everyone helping out at the forum.

Cliff_Woolley · March 25, 2010, 1:41am

That’s a lot of threads. :)

Presumably you know how many threads are in each kernel launch (threads per block * blocks per grid). So to split up the work across multiple launches, you just need to pass some starting offset to each launch to tell it where in the big array to start. Or even better you could pass the kernel a pointer to its starting location within the big array. (Just take care that the starting locations are properly word-aligned.)

Does that help? Let me know if you want me to elaborate.

jeetchauhan · March 25, 2010, 4:43pm

I think you misunderstood my situation. I will be having just 1 launch. I launch a kernel with <<<dim3(1000,1000), dim3(8,8,8)>>> (or maybe even more blocks) and then each thread is on its own to find out where to write to. I cannot allocate 10001000101064 bytes of memory before this launch so I want to be able to allocate only about 150064 bytes (an estimate of number of results is 1500 results per 10001000 thread blocks, which is a lot less than total number of threads).

So with 1 kernel launch how do I tell each thread where to write to? The point is, 1 kernel does all of the work (or at least I can have 1 kernel do it all provided I didnt have this memory limitation).

tmurray · March 25, 2010, 4:46pm

allocate a counter initialized to zero as well as an output array of size MAX_COUNT. pass pointer to output array and counter in your kernel launch. when a thread wants to write, atomicAdd(counter, 1) and use the return value as the index that you know it’s safe to write to.

Topic		Replies	Views
Shared memory issues Initialization of shared memory CUDA Programming and Performance	2	6720	August 23, 2007
Fundamental differences on ways of spreading the load CUDA Programming and Performance	3	616	June 25, 2018
Launch Parameters for Large Problems CUDA Programming and Performance cuda , kernel	13	2001	October 12, 2021
help with some cuda programming CUDA Programming and Performance	9	1818	August 31, 2009
Thread processing overhead CUDA Programming and Performance	2	439	February 16, 2011
CUDA processor allocation CUDA Programming and Performance	7	3437	October 5, 2007
Dynamically restricting # of threads CUDA Programming and Performance	8	6384	August 3, 2008
Memory required for kernel launch CUDA Programming and Performance	3	1741	February 19, 2020
How can I configure this problem is it too big to fit in shared memory? CUDA Programming and Performance	7	3756	October 14, 2008
beginner question regarding shared memory CUDA Programming and Performance	4	6921	November 16, 2009

Occupancy and memory

Related topics