Best way to pack bits into words for global memory Better than reduce in shared memory?

Each thread is creating a one bit answer.
At present I am trying to use code based on Harris’
which uses shared memory. My code uses 5 union ( |= ) operations
to pack 32 bits into a word and then 1 thread in 32 writes the
unsigned int to global memory.

Seems like the sort of thing that lots of people have done before.
Is there a better way?
Does anyone have a working example they could share?

Many thanks

ps: Harris’ code reads shared memory up to 16 past the current block
Is this ok?
Will the hardware complain if that is past the end of shared memory?

    Dr. W. B. Langdon,
    Department of Computer Science,
    University College London
    Gower Street, London WC1E 6BT, UK

CIGPU 2012
choose your background
A Field Guide to Genetic Programming
GP Bibliography

If you are using Fermi or later, lookup the __ballot() method in the CUDA Programming Guide, Section B.12. This lets you quickly combine a bit from each thread in a warp without shared memory, and then you can follow your strategy of having the first thread in the warp write to global memory. The global memory write might take long enough that this doesn’t help, but it would certainly shorten your code.

Dear Stan,

     Thank you for your fast and helpful reply.



In case anyone else is interested here is fragment of my reduce code

__device__ void reduce_pack32(unsigned int *d_Output, 

			      const unsigned int thread_value) {

  volatile unsigned int *sdata = (unsigned int*) shared_array; 

  const unsigned int tid = threadIdx.x;

  sdata[tid] = thread_value;

    //__syncthreads() not needed as operate wholy within single warp 

sdata[tid] |= sdata[tid + 16];

  sdata[tid] |= sdata[tid +  8];

  sdata[tid] |= sdata[tid +  4];

  sdata[tid] |= sdata[tid +  2];

  sdata[tid] |= sdata[tid +  1];

  // write 32 results to global memory

  if((tid & 31) == 0) *d_Output = sdata[tid];


Calling code

const int lid = threadIdx.x & 31;

  // pack and write 32 results to global memory

  reduce_pack32(d_Output, bit<<lid);

The use of volatile is essential.

With compute level 2.0 or more, this can be replaced by

const int lid = threadIdx.x & 31;

  const unsigned int packed = __ballot(bit<<lid); //requires compute 2.0

  if(lid == 0) *d_Output = packed;

However I’m currently having problems with compiling __ballot with nvcc -arch sm_20 (see next post?)

As a minor nitpick you can drop the [font=“Courier New”]<<lid[/font] inside the [font=“Courier New”]__ballot()[/font], since it just tests it’s argument for [font=“Courier New”]!=0[/font].

Thank you for reporting this.


This is a slightly different question. But is it sorta related so I thought I would
continue this topic rather than start a new one. I hope this is ok.

I am now doing another reduction which involves counting the number of flags set
for an odd number of threads. (Actually 20 and 71). The calculating threads
are not (at present) placed on warp boundaries but packed tightly together,
to give 1001 per block.

At present the counting uses a reduction sum in shared memory,
happily giving 22 totals per block, pairs of which are then combined
(each has its own weighting) and the 11 weighted totals are written to global memory.
The reduction uses __syncthreads(), which forces the whole block to syncronise.

Given that the flags only span 2 or 3 warps, can anyone recommend an alternative
to __syncthreads() which only syncronises the affected warps. (Atomics??)
Also do you think I am paying a big penalty by forcing all 11 x (20+71) calculations
to complete before I can write any answers to global memory.

Many thanks
ps: do you think I should have started another topic in the forum?

You can sync just part of your warps (assuming you can figure out how many warps will be involved) from PTX assembly using the optional second argument to [font=“Courier New”]bar.sync[/font].

Although it doesn’t seem to exactly match your purpose, you may also have a look if you can put the new [font=“Courier New”]__syncthreads_count()[/font] function to good use.

(Both of these require CC 2.0 or higher, as __ballot() does already)

Dear Tera, Thank you.
(I need to write PTX to use bar.sync?
Or can I drop into PTX from CUDA?)

Any thoughts on how much syncing all threads in the block might cost?


You can use inline PTX assembly from CUDA C. In the simplest case, where e.g. a fixed number of three warps participate in the sync, it would be something like

asm volatile("bar.sync    1, 96;");

The main cost of __syncthreads() appears to be in a ~40 cycle latency. Syncing only parts of a block is particularly effective if that prevents the SM from running (partially) idle (e.g. if only one block fits per SM).

Great. Thank you very much for your helpful reply
(inc pointer, which I missed on my first reading).

If I put things on warp boundaries (ie use 91 threads out 96)
I guess I could simply reduce the block size to 96?
I think I am getting some benefits from lots of threads per block
as some of the data reads should coincide, so if read by one warp,
others in the block should find their data is already in the cache.
I guess I cannot use grid dimensions to force or encourage nearby
3-warp blocks to be run on the same multi-processor (at nearly the same time)?
Is there any other downside to reducing the block size?
Or perhaps I am wrong and the downside is not big?

I am guessing that in the current wide block (1001 threads) scheme, the most likely
reason for __syncthreads() delaying warps and so making multi-processor idle
is because the further apart my 1001 threads are the more often they read different data
which has not yet arrived.

“The main cost of __syncthreads() appears to be in a ~40 cycle latency.”
–ahha, I have been assuming __syncthreads() does not cost much (computation),
I seem to remember reading it was only a few (4?) clock cycles before sm_20.

Once again many thanks

I’d also think that 96 threads per block is too small, particularly as even numbers of warps per block are preferred (even multiples of 4 for Kepler devices apparently). And I don’t know any way to force adjacent blocks onto the same multiprocessor either. I don’t see any downside to a large blocksize, so I’d also recommend to go with that.

As to whether it is better to align groups of 91 threads on warp boundaries or not is difficult to predict. Seems like you need to benchmark both versions to decide.

Note that __syncthreads() also creates a boundary beyond which the compiler cannot move reads to mitigate the effects of latency. This may be another reason why they are detrimental to performance. Thread-level parallelism (i.e. switching to different warps while waiting for the data to arrive) alone is not sufficient to hide global memory latency.

The Programming Guide in general only mentions latency in very few places (unfortunately). This is not limited to __syncthreads(). Almost all instructions have a latency of ~20 cycles where the Programming Guide only mentions a throughput of one/two (warp-wide) instructions every 1/2/4 cycles. Shared memory has ~30 cycles latency. So __syncthreads() isn’t that much different from other instructions. Although of course if would be nice if this were documented better.

The main performance impact when using __syncthreads() is that it limits parallelism. Fewer and fewer warps are active, until just before the barrier releases there is only a single active warp left in the thread block. This means you would want to run at least two thread blocks per multiprocessor to mitigate this effect, so that other thread blocks “pick up the slack” caused by a barrier in one of the thread blocks.

I just checked and the PTX documentation doesn’t say anything about bar.sync behavior in the case that the first (barrier number) argument doesn’t evaluate identically among a warp. So I guess if you want to divide your block into subblocks that each synchronize at the same time (using a different barrier number), but not between the whole block, you are pretty much forced to align subgroups on warp boundaries. (Would be surprising anyway if Nvidia had invested the silicon needed to overcome this limitation.)

If however you want a certain subgroup to sync while the rest of the threads stay unsynchronized, warp alignment wouldn’t be required as additional synchronization doesn’t harm (barring potential deadlocks).

Dear Njuffa,

       Thank you for your reply.


So I guess a block size of 1001 threads is a mistake?

Compiling with --ptxas-options=-v says “ptxas info : Used 24 registers”,

so I guess there are only 32k - 24*1001 registers (on a C2050) left over and

this will not be enough to run a second block on the same multiprocessor?


Would two blocks be enough? I guess the advantage of three is marginal?

(If I went for only 96 threads (3 warps) there would be enough registers for 14?

but before that I would hit the 1024 thread limit?)

Thanks again


Some more info. As an experiment I removed all __syncthreads()
(I expect that the kernel will now calculate some wrong answers).
On a C2050 the kernel took 49.764ms less (on 477 102 080 additions).
This save 8% on the whole kernel.

I’d say you should go for big blocks (1001 threads) only if you get the PTX based partial synchronization to work. Otherwise, use small blocks to allow more than one running per SM.

8% overhead for synchronization seems like a plausible number.

Of course, unrelated considerations might account for more than 8% difference in speed, so that in the end you choose a different blocksize.

On sm_20 (which is what the C2050 uses) there is a hard limit of 1536 threads per multiprocessor. So two thread blocks of 1001 threads cannot be scheduled onto the same multiprocessor simultaneously, regarddless of other resource constraints. For reasonable granularity that allows code to achieve high occupancy, I generally recommend thread block sizes in the range of 128 to 256 threads for Fermi parts. Application driven limitations, as well as resource constraints, in particular register and shared memory use, will usually push the choice to either the high or the low end of that desirable range.

Please note that the hardware imposes fairly coarse granularity on the register allocation for thread blocks, thus in many cases the actual number of registers used will be higher than a simple multiplication of registers per threads by threads per block would suggest. The occupancy calculator that ships with CUDA incorporates these granularity rules and gives an accurate prediction of occupancy.

Dear Tera, What do you think of the idea of replacing __syncthreads()

by doing reduction summation only with each warp and using atomics

to update (cross warp) totals held in shared memory. In addition to the

totals I think I would need a count of number of threads which have contributed

to the totals. The first thread to increase this to 91 would complete the

calculation, write the final answer to global memory and clean up

(ie zero totals ready for next time).

Hmm having thought of this, I’m not sure it could be made water tight, eg

what if the next caculation start using a total before the previous one

has zeroed it.

Thank you very much