Implicit Warp Synchronization prevents hiding of memory latency

Hello there,

Suppose we’re dealing with vectors of size 32 elements (i.e. 1 warp), then there will be implicit synchronization (thus avoiding the need for syncthreads() function)

for example:

[indent]r = smem[tid];

// syncthreads();

smem[tid] = r;

[/indent]here we can avoid the use of syncthreads() function if tid < 32

So, warp synchronization mimics syncthreads() behaviour

therefore, if we coded the left side (below) the right side would also be equivalent

External Media

BUT the right side says memory latency is not hidden! External Image

(i.e. syncthreads forces read-after-write latencies where they are not needed)

I measured the above code (except i used tid < 64 (i.e. 2 warps)) and found that the latency does not get hidden

(i also checked in decuda to make sure there weren’t additional instructions sneaking in)

One work around is to run multiple blocks per MP (i’ve tested it and it works, this is also what Volkov did in his paper and code)

…however multiple blocks per MP is not viable for my algorithm :(

Does anyone know any other work arounds or additional info/comments?

ideally i would like to prevent the damn implicit synchronization

Your example isn’t well chosen for a couple of reasons.

First, shared memory has no latency. Potential bank conflicts causing some serialization, yes, but no latency.
Your example would be more interesting with device memory access.

Second, your example doesn’t particularly need syncthreads() at all. If any particular thread accesses a location but no other thread does, there’s no need to sync at any time even if you do multiple reads or writes.

Now you didn’t mention what you were measuring or how… but if you noticed your syncthreads() kernel is slower, that’s not surprising. The syncthreads causes your active thread count to drop leaving just a few threads alive… the rest are all waiting. If your number of active threads per SM (not block) drops below 192, you can have (usually minor) idle clocks and less efficiency.

But in your kernel above even if you have plenty of active threads, the syncthreads() will cause overheads. the read and write of shared memory is a one-clock-op, and syncthreads() is two clocks if I recall correctly, so even with full thread counts with multiple blocks your throughput will be lower than the non-syncthreads version. Usually syncthreads() is used sparingly and a couple clocks is trivial. If you’re using them after every simple assignment, though, that overhead is noticeable.

If you have a specific piece of code you’d like to reorganize, it may be better to post it. Your current example doesn’t need syncthreads() at all even for larger tids, since there’s no syncing actually needed.

The best code to look at to understand careful dancing of syncthreads is the great reduction code in the SDK, and Mark Harris’s terrific PDF docs for it (but read the code too!)

thanks for the reply!

no latency for shared memory? in Volkov’s benchmarking paper, he finds 36 clock cycles latency to go from smem to register
here’s an exert:
External Media

furthermore, if we’re just looking at registers, there is always read-after-write latency of ~22-24 clock cycles
which is why it’s recommended to use block size of 192 (i.e. 6 warps, instructions take 4 cc each = 24cc) to hide this latency

right?

in my example i’m transferring data from shared memory to registers and then back again
the goal is to achieve peak bandwidth throughput (i.e. 4cc per warp or 32Bytes/cc) using small thread blocks
thus i must hide all latencies

the problem is the implicit synchronization is being applied where it’s not needed

Shared Memory is as fast as Registers - NVIDIA prog guide

With bank conflict - The worst case could be (my guess) 32 cycles. (31 threads read from same bank, 1 thread another bank)

thanks for the reply

I think the worst case is a 16-way bank conflict. Smem transfers are handled in half warps (so 16 elements)

The bandwidth drops by a factor equal to the separate memory requests (so i think 16 times longer)

BUT I do not have bank conflicts or mis-aligned data transfers

so this is not applicable to my little example

yes that is what it says, but i remember reading cases where this is not always true

for instance if you did MAD instruction with all regs as operands, it takes 4cc

if you did MAD instruction with 1 operand from smem, it takes 6cc

but let’s forget about shared memory latency for a moment, there is still read-after-write latency. This is at the register level, think of this as the latency that is being invoked by warp synchronization

This it correct.

Current NVIDIA GPUs cannot overlap multiple requests to shared memory from the same warp. They hide the latency by interleaving execution between warps instead (from the same or another block).

So 64 threads or less by multiprocessor is not enough to cover the various execution latencies that include smem latency. Hence the recommendation to run at least 192 threads / SM.

Thank you so much Sylvian, you seem to always come through on the tough ones

i do however have an issue with part of your statement

if this was the case, shouldn’t I theoretically be able to hide latency using block_size = 64 by issuing “enough” independent instructions?

i.e.

r1 = smem1[tid]

r2 = smem2[tid]

r3 = smem3[tid]

r4 = smem4[tid]

I’ve found (empirically) that I couldn’t hide latency between warps from other blocks… (however within the same block works fine)

it’s like warp synchronization is being applied at the thread block level (i.e. like the picture in my first post but for tid >32 )

My timing results:

percentage ‘%’ is w.r.t. theoretical bandwidth

of transfers indicate independent transfers

i.e. 2 transfers is

r1 = smem1[tid]

r2 = smem2[tid]

smem1[tid] = r1

smem2[tid] = r2

the theoretical values assume 21 cc latency (just cuz it fits my results nicely) and is done for 1 memory transfer (i.e. if block size = 64, it takes 8cc to move 2 warps, 8/21 → 38%)

[indent][font=“Courier New”]thread block of 64 elements

(theoretical = 38%)

1 transfer: 38%

2 transfers: 44%

3 transfers: 44%

thread block of 96 elements

(theoretical = 57%)

1 transfer: 56%

2 transfers: 63%

3 transfers: 66%

thread block of 128 elements

(theoretical = 76%)

1 transfer: 75%

2 transfers: 75%

3 transfers: 75%

thread block of 160 elements

(theoretical = 95%)

1 transfer: 94%

...

[/font][/indent]Despite having 2 or 3 independent memory transfers, they don’t seem to help with hiding the latency! It’s as if there are syncthreads() functions after each instruction (like the pic in my 1st post but tid > 32)

should i be seeing this kind of behaviour or am i probably doing something wrong? if anyone has suggestions or comments I would be happy to hear from you :)