Each thread is creating a one bit answer.
At present I am trying to use code based on Harris’ reduction_kernel.cu
which uses shared memory. My code uses 5 union ( |= ) operations
to pack 32 bits into a word and then 1 thread in 32 writes the
unsigned int to global memory.
Seems like the sort of thing that lots of people have done before.
Is there a better way?
Does anyone have a working example they could share?
ps: Harris’ code reads shared memory up to 16 past the current block
Is this ok?
Will the hardware complain if that is past the end of shared memory?
Dr. W. B. Langdon,
Department of Computer Science,
University College London
Gower Street, London WC1E 6BT, UK
If you are using Fermi or later, lookup the __ballot() method in the CUDA Programming Guide, Section B.12. This lets you quickly combine a bit from each thread in a warp without shared memory, and then you can follow your strategy of having the first thread in the warp write to global memory. The global memory write might take long enough that this doesn’t help, but it would certainly shorten your code.
This is a slightly different question. But is it sorta related so I thought I would
continue this topic rather than start a new one. I hope this is ok.
I am now doing another reduction which involves counting the number of flags set
for an odd number of threads. (Actually 20 and 71). The calculating threads
are not (at present) placed on warp boundaries but packed tightly together,
to give 1001 per block.
At present the counting uses a reduction sum in shared memory,
happily giving 22 totals per block, pairs of which are then combined
(each has its own weighting) and the 11 weighted totals are written to global memory.
The reduction uses __syncthreads(), which forces the whole block to syncronise.
Given that the flags only span 2 or 3 warps, can anyone recommend an alternative
to __syncthreads() which only syncronises the affected warps. (Atomics??)
Also do you think I am paying a big penalty by forcing all 11 x (20+71) calculations
to complete before I can write any answers to global memory.
ps: do you think I should have started another topic in the forum?
The main cost of __syncthreads() appears to be in a ~40 cycle latency. Syncing only parts of a block is particularly effective if that prevents the SM from running (partially) idle (e.g. if only one block fits per SM).
Great. Thank you very much for your helpful reply
(inc pointer, which I missed on my first reading).
If I put things on warp boundaries (ie use 91 threads out 96)
I guess I could simply reduce the block size to 96?
I think I am getting some benefits from lots of threads per block
as some of the data reads should coincide, so if read by one warp,
others in the block should find their data is already in the cache.
I guess I cannot use grid dimensions to force or encourage nearby
3-warp blocks to be run on the same multi-processor (at nearly the same time)?
Is there any other downside to reducing the block size?
Or perhaps I am wrong and the downside is not big?
I am guessing that in the current wide block (1001 threads) scheme, the most likely
reason for __syncthreads() delaying warps and so making multi-processor idle
is because the further apart my 1001 threads are the more often they read different data
which has not yet arrived.
“The main cost of __syncthreads() appears to be in a ~40 cycle latency.”
–ahha, I have been assuming __syncthreads() does not cost much (computation),
I seem to remember reading it was only a few (4?) clock cycles before sm_20.
I’d also think that 96 threads per block is too small, particularly as even numbers of warps per block are preferred (even multiples of 4 for Kepler devices apparently). And I don’t know any way to force adjacent blocks onto the same multiprocessor either. I don’t see any downside to a large blocksize, so I’d also recommend to go with that.
As to whether it is better to align groups of 91 threads on warp boundaries or not is difficult to predict. Seems like you need to benchmark both versions to decide.
Note that __syncthreads() also creates a boundary beyond which the compiler cannot move reads to mitigate the effects of latency. This may be another reason why they are detrimental to performance. Thread-level parallelism (i.e. switching to different warps while waiting for the data to arrive) alone is not sufficient to hide global memory latency.
The Programming Guide in general only mentions latency in very few places (unfortunately). This is not limited to __syncthreads(). Almost all instructions have a latency of ~20 cycles where the Programming Guide only mentions a throughput of one/two (warp-wide) instructions every 1/2/4 cycles. Shared memory has ~30 cycles latency. So __syncthreads() isn’t that much different from other instructions. Although of course if would be nice if this were documented better.
The main performance impact when using __syncthreads() is that it limits parallelism. Fewer and fewer warps are active, until just before the barrier releases there is only a single active warp left in the thread block. This means you would want to run at least two thread blocks per multiprocessor to mitigate this effect, so that other thread blocks “pick up the slack” caused by a barrier in one of the thread blocks.
I just checked and the PTX documentation doesn’t say anything about bar.sync behavior in the case that the first (barrier number) argument doesn’t evaluate identically among a warp. So I guess if you want to divide your block into subblocks that each synchronize at the same time (using a different barrier number), but not between the whole block, you are pretty much forced to align subgroups on warp boundaries. (Would be surprising anyway if Nvidia had invested the silicon needed to overcome this limitation.)
If however you want a certain subgroup to sync while the rest of the threads stay unsynchronized, warp alignment wouldn’t be required as additional synchronization doesn’t harm (barring potential deadlocks).
Some more info. As an experiment I removed all __syncthreads()
(I expect that the kernel will now calculate some wrong answers).
On a C2050 the kernel took 49.764ms less (on 477 102 080 additions).
This save 8% on the whole kernel.
On sm_20 (which is what the C2050 uses) there is a hard limit of 1536 threads per multiprocessor. So two thread blocks of 1001 threads cannot be scheduled onto the same multiprocessor simultaneously, regarddless of other resource constraints. For reasonable granularity that allows code to achieve high occupancy, I generally recommend thread block sizes in the range of 128 to 256 threads for Fermi parts. Application driven limitations, as well as resource constraints, in particular register and shared memory use, will usually push the choice to either the high or the low end of that desirable range.
Please note that the hardware imposes fairly coarse granularity on the register allocation for thread blocks, thus in many cases the actual number of registers used will be higher than a simple multiplication of registers per threads by threads per block would suggest. The occupancy calculator that ships with CUDA incorporates these granularity rules and gives an accurate prediction of occupancy.