Migrating to Double Precision Increased smem footprint

I see that migrating to DP would not just be changing “floats” to “doubles”. It would also mean to tune your kernel or change the algorithm to get best performance on DP hardware.

For example, the shared memory footprint (if u r using double array) would just double its size resulting in kernel occupancy change. This can really cause lot of performance problems for many kernels.

Also, how is a double stored in shared memory… Is a double number stored in 2 consecutive banks or is it stored in just one bank… Will a smem load/store of double result in 2 separate access to shared memory OR just one?

Do DP hardware have increased shared memory per Multiprocessor?

Any inputs on how to design a CUDA kernel to work on both SP and DP efficiently?

Thanks in Advance,

Best Regards,
Sarnath

With double precision, you get the same amount of constant and shared memory, which effectively means you can only store half as many numbers.

Thanks…

but…

disss is really bad news… All my kernels gonna kicked youKnowWhere… CUDA occupany is going to be affected badly… Hmm…One must run atleast 384 (192*2) active threads to be good on SP as well DP hardware without exposing latencies…

Hmm… Belated thinking from my side… Let others take this as a lesson…

Once upon a time, I used to advocate warp-size number (32) of threads per block. But such kernels can greatly be affected by this DP thing… Probably the best config is to have 64 threads per block and have 8 active blocks to scale well on DP hardware as well. The manual too advocates 64 threads per block. And so did Mr.Anderson long time ago. I muss have listened…

I think that 32 threads is a bad idea mostly because it means you can not have more than 256 threads per SP (max. 8 active blocks), which means at most 1/3 occupancy - which out of experience I’d say is not enough to get good speed with memory-bandwidth bound kernels.

Making 512 threads fit may be a bit of a pain (means a maximum of 16 registers) but is possible for many functions, at least with a bit of trickery.

True… With 8 active blocks, u have less than 2K of smem per block… less than 512 floats (or ints) per block.

In certain cases, 32 threads have some hidden benefits. Like – no requirement for double-buffering (which would help in reducing smem footprint), no syncthreads and so on. I have seen 112x speedups with just 32 threads per block. But with the DP coming into picture and the changed CUDA occupancy scenario, I would never recommend 32 threads to any1. It is like shooting your own feet.

IMHO, 256 threads is good enough number to hide latencies! 192 is needed to hide register latenices. 256 active threads is good enough for global memory as well.

That is not quite right, that benefit is from only 32 threads using the same shared memory, you can just merge two blocks with 32 threads into one block with 64 threads without changing the algorithm (you will probably need one or two additional registers though), there is no need for synchthreads then.

Of course this is pointless if your blocks already use more than 1/10th of registers (i.e. 25 per thread) or shared memory (ca. 1600 bytes).

32 threads do avoid races… for example: consider x[i] = x[i]+x[i+1]

If I do this operation on a shared memory array with 32 threads, I would get desired results.

If not – I will end up with a race. I need a double buffer to resolve it – like

y[i] = x[i] + x[i+1] and then swap pointers x and y.

Because you are doing it wrong.

i = threadIdx.x & 31;

if (threadIdx.x < 32)

  x1[i] = x1[i] + x1[i + 1];

else

  x2[i] = x2[i] + x2[i + 1];

that is using the same algorithm. And yes you probably should do it in a more intelligent way that does not need the “if”, but you need neither double buffer nor synthreads like that.

If you just use x[i] = x[i]+x[i+1] for the 64 threads code you are using a different algorithm.

No… I am doing it in the most simplistic and elegant way possible. I would make 8 active blocks with 32 threads each – good enough to satisfy all latency conditions…

Even this is wrong. The race still exists… The 31st threadIDx still access 32nd element which is the first element in x2 array… There is still a race between warp0 and warp1. How will you avoid that?

The only way to avoid is to do sthg liek this:

local1 = x[i];

local2 = x[i+1];

__syncthreads();

x[i] = local1 + local2;

at the cost of extra syncthreads… Imagine that inside a FOR loop – you will be bleeding performance… Even the IF statement inside the FOR loop (although ur example is wrong) – will bleeeeeeed performance for large data-sets.

One intelligent way that we can do with “64” threads is to partition the data-set into two halves(instead of consecutive sub-sets) and apply the individual 32 threads to the individual halves. In this case, you will end up with only one race-condition occuring at the middle of the buffer. You can always cover it up…

And this all comes up with some extra pain and code…

How did you avoid the race between block 0 and block 1 before?

To make it clear: The idea is that warp 0 does exactly what block 0 did before and warp 1 does exactly what block 1 did before. If you have a race between warp 0 and warp 1 you had one between block 0 and block 1 (which means your code was wrong).

What I meant was: Each block operates on a shared memory buffer in a x[i]=x[i] + x[i+1] fashion… Each block does some independent job. There is no relation between blocks. In such conditions 32-threads might come handly. Anyway, DP thing around – it is better to stick with more threads.

Now I understand your point of view as well. Thanks.