Occupancy and memory


I am new to CUDA so bear with me. I am doing some number theory stuff with CUDA. My problem is that I will be (or can) creating lots of threads (potentially in the order of 10^9). Each thread computes something and if it finds a result, it needs about 6*4bytes of storage. Now, on my hardware, I would run out of memory if I considered the worst case and allocated memory for each thread to store a result. (Usually thats what I would do as I could calculate an ID for each thread and use that as an offset into my allocated memory).
Therefore, I am unable to launch a kernel with a huge number of threads. I do have an estimate though, as to how many results I may get. Is there a way for me to allocate memory according to this estimate and have all threads “append” their result? The threads will obviously span across thread blocks and I cannot synchronize across thread blocks. What are my options? Is there a type of memory that will help me do something like this?
Thanks to everyone helping out at the forum.

That’s a lot of threads. :)

Presumably you know how many threads are in each kernel launch (threads per block * blocks per grid). So to split up the work across multiple launches, you just need to pass some starting offset to each launch to tell it where in the big array to start. Or even better you could pass the kernel a pointer to its starting location within the big array. (Just take care that the starting locations are properly word-aligned.)

Does that help? Let me know if you want me to elaborate.

I think you misunderstood my situation. I will be having just 1 launch. I launch a kernel with <<<dim3(1000,1000), dim3(8,8,8)>>> (or maybe even more blocks) and then each thread is on its own to find out where to write to. I cannot allocate 10001000101064 bytes of memory before this launch so I want to be able to allocate only about 150064 bytes (an estimate of number of results is 1500 results per 10001000 thread blocks, which is a lot less than total number of threads).

So with 1 kernel launch how do I tell each thread where to write to? The point is, 1 kernel does all of the work (or at least I can have 1 kernel do it all provided I didnt have this memory limitation).

allocate a counter initialized to zero as well as an output array of size MAX_COUNT. pass pointer to output array and counter in your kernel launch. when a thread wants to write, atomicAdd(counter, 1) and use the return value as the index that you know it’s safe to write to.