Double/transendental architecture behavior

Please help me understand how the SM 1.3 double hardware actually works.

There are 8 SPs that handle the single precision float computes and most logic.
Normally a warp is called and each SP handles one thread. The SPs are invoked 4 times sequentially, all with the same instruction, to complete the warp after 4 clocks.
After this, the next warp takes over. There’s pipelining latency involved as well, mostly for resolving the registers, but that doesn’t affect the basic scheduling of “call warp A 4 times over 4 clocks to fill the 32 threads, then go to the next warp and repeat.”

Now, my question, how does DOUBLE support work? There’s ONE double precision unit (not 8).

Does warp A take 32 clocks, each tick filling a thread, and then after this is done, the next warp comes in?
I suspect that’s what happens.

BUT what if there are other ready threads that don’t use the double precision unit? Do they get scheduled in parallel if they are only using the normal SPs?
That is, can there be simultaneous execution of single precision math in one warp and double precision math in another warp?
This seems like an obvious way to hide the bottleneck of the low double-precision throughput.

You can ask the same question about transcendental support. If I recall correctly there are TWO units that handle this, so that takes 16 clocks?
(Is that right? 8 floating point SP workhorses, 2 transcendental units, and one double unit, in G200 at least). I could see that perhaps after 4 clocks of DP computes a new SP-only warp could also start (I understand a 4 clock delay to allow the instruction decoder to prep the next warp’s instruction.)

I have code that needs to do occasional DP computes, and it’s useful to me to understand how it will affect my performance. If the hardware can “double run” both types of units, then it’s like some occasional DP computes are FREE, since their occasional use won’t interfere with the main execution pipe. But if using DP units actually just lets the single SP units idle, I’ll lose real compute power.

Thanks for any help!

at this time the double and single units are not working at the same time. Given the reactions of the NVIDIA people I asked about it, it looks like something they would like to change in the future though.
And about the compute power lost: that is not so easy to determine.

  • Most kernels are bandwidth bound, so if you are reading & writing singles, you still have your double for free.
  • doubles may introduce more registers needed (depends very much on your kernel offcourse) so you might have less occupancy (which might be a problem, but then again it might be not a problem)

So in general, you can not really say you will lose compute power, unless you are one of the happy people that are actually compute-bound ;)

Agree, NVIDIA was quite logical to increase number of registers to 16k after enabling double operations.

However my practice quickly shown that this is not enough. Most 64-bit functions seem to be computed as subprograms (this is rather likely). This lead to a huge increase in register use, compared to float computations. I’ve tried to change one piece of code for use of doubles, and immediately had to decrease number of threads per block to 128 (which is almost always not enough to utilize memory bandwidth).

My conclusion is that to make ‘double’ things work, NVIDIA will have to:

  1. Increase number of registers even more than in sm_13;
  2. Compute double functions on chip.
  3. And certainly, increase shared memory size! In sm_13 they missed this important thing. Maybe this is not typical, but I’m constantly encounter lack of shared memory, which is the only quick read-write memory on board!

Current implementation of double seems to be not quite usable, even for small computations.

That is compared to hardware with 8k registers? Also, why are you speaking of subprograms? The GT200 architecture has a double precision unit in hardware, it is not a software solution where 4 float calculations are done to obtain 1 double result.

That is completely different from the presentations I have seen so far by people who are using double precision. They are seeing less than 8x slowdown (what you would expect with 1 vs 8 units per SM).

More shared memory would indeed always be welcome, but as far as I understood, it would require a large amount of die-space (and the GT200 is already huge!).

Indeed, but I don’t think the double precision unit supports the gamma function or error function in hardware…

Vijay Pande (of folding@home) happened to mention at NVISION 08 one of the surprise benefits of CUDA was the speed of erfc().

That apparently is used a lot in some kinds of molecular simulations, and x86 CPUs were especially slow at it, with a measurable slowdown

on x86 code was that erfc() call.

That doesn’t tell us whether erfc() is a fundamental hardware instruction on the GPU, but it was interesting nonetheless.

What I noticed is a large (higher than twice) increase of use of registers, when switched from float to double. This is not good. Thoughts about functions implemented as subprograms are only a guess, explaining this behavior.

More shared memory should not require so much more space. Current amount is 16384 bytes (1Mbit) per MP. Say, 30*1Mbit per GTX280, resulting in, AFAIK, 60 000 000 transistors. This is only 4% of all transistors (and, correspondingly, area) on the chip. They could have easily spent more space for quick memory, and as for me, this would be great. Considering dual increase in shared memory will result in 4% more chip space, chip price, and heat dissipation - would you afford this?

Another approach is creating at least small cache for the global memory. Even small amount of cache per MP will greatly help in (at least) continuous read patterns (keeping in mind that read cache is a much easier thing for SMP, than read-write cache). Without cache, we have to implement the same scenario (read to shared memory in threads, distribute data, work) again and again, this greatly decrease program efficiency and supportability, making kernel code less understandable.

Sounds rather pessimistic, huh? ;-)

Anyway, big thanks to NVIDIA for these quick devices which are already somehow usable :-)

I think he is referring to all of the math functions, like sin(), cos(), exp(), etc… The double precision versions of these functions do some argument reduction, then evaluate a polynomial expansion of the transcendental on a restricted range. These subprograms can inflate register usage a lot.

(Originally, I thought that single precision transcendentals used argument reduction, followed by a call to the intrinsic hardware function, like __sin(). However, now reading the math_functions.h header, I see that they too use a polynomial expansion in a limited range. The polynomial expansion in single precision is shorter, simply because fewer terms in the series are needed to reach the desired accuracy in this case.)

continous read patterns should be coalesced, so I do not really understand why cache would help. Doesn’t linear texture memory help in your case?

Higher than twice amount of registers is very interesting (and bad), can you see in the generated ptx what happened to cause that? Luckily, I have still been able to rewrite code to not require double precision up to now, but I am sure the day will come…

Actually just a side question, but are DP fundamental ops like + - * actually even 1-clock throughput operations?
Is there FMADD for DP?

Side side question, hopefully someone will make a nice test harness for us to learn the throughput clock counts for all the different math ops. I think we’ve all hacked our own tests many times but I never get great results where I am confident of exact clock timings.

By continuous read I mean reading chunks of data one by one by the same thread. This is not coalesced.

Sure its a good idea to change kernel so that neighbor reads are performed by neighbor threads. Still, it depends on the task, and not always possible.

Increasing shared mem is more than just adding 6 transistors per bit. There’s support hardware to facilitate dynamic addressing, and latency to think about. Increasing registers i’m guessing is much easier than increasing shared mem.

But obviously the main reason shared mem wasn’t increased is that there’d be no benefit to DirectX.

But that’s ok, because what you have to keep in mind is that the register file is also a powerful memory. In-register arrays can be used for many things, and they’re in fact twice as fast as shared memory. (Because you don’t need to do any address calcs and also because registers are fundamentally somewhat faster)

I don’t know. Perhaps increasing shared mem is a good idea. But in general, we can’t keep wishing for the SMs to get bigger and bigger. After all, that takes away from the total power of the chip.

There’s fused multiply-add, which has higher precision than the single-precision MAD beyond the single vs. double bit (there’s no loss of precision between the multiply and the add).

Sure, it’s quite obvious that current graphics card architecture is tuned for graphics. So all above are just my wishes which could increase suitability of these devices for extended range of computational tasks.

Btw, what do you mean by mentioning “in-register arrays”? As far as I know registers do not support indexed addressing. Or am I mistaken?

That’s where #pragma unroll comes in. When it works (and doesn’t silently tell you to f off… thanks CUDA implementors!), it turns dynamic indexing static and your array is placed into registers. (Also, you may need -maxrregcount)