How many threads are physically running on an SM? At any given moment of time

Hi All,

Let’s assume that we have to build a histogram, like the 64-bin one in the SDK example. The manual states that in this case each thread has to build its own histogram to avoid the shared memory access collisions, what sounds reasonable. So the amount of shared memory needed is (number of bins) x (number of threads).
But let’s assume we have W active warps active on 1 SM. Of these W warps only one will be executed at any given moment of time while the others stay idle. Does it mean that the amount of shared memory can be decreased to (number of bins) x 32, or even (number of bins) x 16 since each warp is processed by halfs? Let’s forget about the bank conflicts by now…

Just to clarify the things…
the only goal is to avoid shared memory collisions (when different threads write to the same address at one moment of time).

the number of bins is potentially much larger than the number of warps, that’s why I can’t afford each thread build its own histogram because of lack of shared memory. (more bins decrease the collision probability, but it is still not zero)

I have to operate with floats (each bin accumulates floats, not integers)

I appreciate your suggestions

Indeed, there should be only one warp per block physically running at any moment.
However, there’s no way to prevent time slicing from happening between ld.shared and st.shared. Seems time slicing always happen after each instruction.

Yep, I see that

asadafaq is correct about time-slicing being possible between smem read and write. While time-slicing isn’t guaranteed to happen after each instruction, it can and does happen.