I have a double precision kernel and in order to minimize register usage… I have lot of RAW (read after write) dependencies in the main computation function.
I read in the programming guide that once we have 192 threads / block the gpu hides this RAW latency.
Is that also valid for double precision register variables :unsure: , or do we need more threads per block ? ( each double requires two registers)
Any Nvidia guys know this … please ? :unsure:
Thanks very much
I would not be surprised if hiding read-after-write dependencies requires fewer threads with double precision. Since there is only one DP unit per multiprocessor, the threads in your warp have to propagate through it serially when performing double precision operations. If one warp is more than enough to fill that pipeline, then read-after-write dependencies would no longer matter.
It would be nice if someone with a direct understanding of the hardware could comment, but I wouldn’t worry about it for now.
Yo! I too have the same feeling as Seibert…
But I think it is better to have 192 threads running. - just to be on safe side.