Shared memory broadcast and doubles

e.ping · May 24, 2008, 1:15am

I just received a T10P board in the mail today, and was playing with the native double support in one of our kernels. While looking at profiler output, I noticed that the shared memory broadcast mechanism only works for 32-bit types. When I flipped one of our shared arrays from float to double, suddenly the warp_serialized counter was non-zero.

(Reading through the programming guide section on shared memory, it clearly states that broadcast applies to 32-bit words. So, this is documented behavior even though it surprised me.)

Is broadcasting a double to an entire half-warp equivalent to a 16-way bank conflict?

Mark_Harris · May 27, 2008, 11:28am

Shared memory does technically only support 32-bit words, but broadcasting a double to an entire half-warp should be the same as broadcasting two floats. So there should be no bank conflicts.

Are you sure the warp-serialize is not occuring when you write the doubles into shared memory? If threads write sequential doubles to shared memory, you will incur 2-way bank conflicts.

You could avoid this by splitting the high and low words of the doubles into separate arrays (much as you would with a struct with two floats, such as a complex type). However, this extra work may be more trouble than it is worth, since 2-way bank conflicts are only 2x slower than no conflicts…

Mark

e.ping · May 27, 2008, 12:12pm

This is a good point about the writing… And yes, bank conflicts are certainly not anywhere close to the performance bottleneck in this kernel, but I was curious about the change.

Regarding splitting a double into two 32-bit words and rejoining them, does this require using the __double2hiint()/__double2loinit() and __hiloint2double() functions? No discussion of these functions is given in B.2.2 in the Guide.

Mark_Harris · May 31, 2008, 11:14am

First, let me say this: I highly recommend you don’t split the doubles in software that you expect to be around for hardware, say, 12 to 24 months from now, because hardware in that timeframe may/should make it unnecessary, in which case splitting would perform worse than not splitting.

That said, here’s a relevant snippet from our current DGEMM code that does something similar.

__shared__ int AA_lo[(TILE_DIM+1)*TILE_DIM]; 

__shared__ int BB_lo[(TILE_DIM+1)*TILE_DIM]; 

__shared__ int AA_hi[(TILE_DIM+1)*TILE_DIM]; 

__shared__ int BB_hi[(TILE_DIM+1)*TILE_DIM]; 

AA_lo[idxAA] = __double2loint (parms.A[addrA]);

 AA_hi[idxAA] = __double2hiint (parms.A[addrA]);

do {

    dp = fma(__hiloint2double(AA_hi[li],AA_lo[li]),__hiloint2double(BB_hi[lj],BB_lo[lj]),dp);

    li++;

    lj++;                                                                                

    ll--;                                                                                  

} while (ll);

e.ping · May 31, 2008, 3:25pm

OK, that makes sense. Double precision sincos() takes all the time in this kernel anyway, so I won’t fret about shared memory.

Thanks!