I’m quite new with the CUDA programming and I have a question. Is it possible to have an identifier of the current thread running on the device. What would be best for me, is that this identifier would be unique across blocks (but not needed across devices) and would be between 0 and the maximum number of threads that could be run on the device.
The reason for this is the following:
I’m trying to transform a sequential algorithm on CUDA. In this algorithm, I need to update counters. What I would like to do, is that each thread have its own counter. At the end, I would just need to compute the sum of the counters and everything would be fine. It means that if I have k counters, I would create a matrix containing k*nbthreads counters. The only problem is that I need a lot of counters (~60000) and the number of threads can be quite important also (much more than the maximum of threads allowed on the device). This is why I was thinking that if I could make a mapping between each thread and an identifier between 0 and the maximum number of threads, my matrix would be much smaller and could fit on the memory of the device.
Do you know how I can have such identifier? Or do you have any hint that would avoid using such identifier?
Thanks for the answer. Unfortunately, this solution doesn’t match my needs. idx might vary between from 0 to nbblock*blocksize which will be a bigger range than 0 to the maximum number of threads that can be run concurrently on a device.
What I would like is more something like the identifier of the ALU running the thread. That way, I won’t waste too much memory, having only one copy of my counters for each ALU. Do you know something like that?
That’s a way I could look up, thanks. But will 1.2mB fit into the memory of each thread?
If there was only 60k counters it wouldn’t be a problem, I agree. The problem is that if I don’t try to make any optimization, I would need 60k counters for each of my 10 millions threads. This means a lot of memory…
Ah, I see now. Your original post was not clear that there were to be 60k counters per thread. That is indeed far too much for either registers or shared memory.
Does every thread contribute to all 60k counters? Or does each hit only a few scattered counters? For the scattered writes, you might actually get decent performance by storing only one instance of counters in device memory and then using atomicAdd in the threads. Atomics are fast in Fermi and even faster in Kepler. If you have too many collisions for that to be a viable solution, you could store one set of counters per block in shared memory - threads in that block would use shared memory atomics to update the counters. Unfortunately, 60k counters will not fit in 48k of shared memory, so you will need to run multiple passes to collect all the results.
This is doing exactly what I wanted. I just needed to specify -arch compute_20 to nvcc and it worked without any problem. On my card (Quadro 1000M) the range is between 0 - 3071. (Note: 3071 = (nb of multiprocessors * max threads per multiprocessor) - 1 )
That is a great suggestion. I’m planning to divide the work, but I wanted a “simple” version to start. It will be easier to explain to the rest of the team. And once the team will validate my work, I will start every optimization like this one, and also the ones defined in the Best practice guide.
Sorry if my explanation wasn’t clear enough. It’s not easy to be very clear :-)
Not each thread will contribute to every counters. I had never heard about atomicAdd before. Do you know how does it work internally? Is there some kind of cuda mutex? I will make some measurments and then I will implement both versions and see which is faster. I will look up for this function, thanks for the tip!