Programming Guide 2.0 -- feedback

As a CUDA newbie poring over CUDA Programming Guide Version 2.0 (dated 6/7/2008), I’d like to offer some feedback. These are just my own take as reader, not necessarily errata. But I hope it might be a little useful for the tech writers.


It’s not clear to me on first or second reading why the broadcast mechanism described in Fig. 5-8 does not solve the 8-way bank conflict in the right side of Fig. 5-7. Perhaps some minor clarification would make this more obvious.

Section 5.4

But reads for constants are cached too, right, and locality should apply? So is there an implied difference in the constant/texture cache, or did this bullet refer to a difference from global reads only?

Section 5.5

Refering to this paragraph:

My brain has to double-parse this because global memory is “device memory” as described in 3.1. The paragraph is correct as written, but the double-entendre trips me up. So “data transfers between the device and global memory” are “data transfers between the device and its global memory” or something that keeps me from interpreting it as “device- and global- memory” the first time.

In fact, throughout the whole document I have to remind myself that “device memory” is the big, high-latency, off-chip memory, which includes local/global/texture/constant memory, but “device memory” does not include the on-device registers, shared memory, or constant/texture caches. So “device memory” is viewed as being on the device from the point of view of the host, but not from the point of view of a “transfer between the device and global memory”. It makes me wish there was a better name for “device memory” but I can’t suggest one.

Section A.1.3

The Specifications for Compute Capability 1.2 has no mention of the relaxed criteria for coalescing described in As a programmer it looks like a big win to not have to worry as much about the coalescing criteria as in CC 1.0/1.1.

Section B.2.2

I wasn’t sure how that jibes with this from A.2:

I’m guessing that both are true, but that for doubles FMADs don’t truncate the intermediate results of the mul. Maybe that could be clarified.


What I know off the top of my head:

  1. The broadcast mechanism only solves bank conflicts if every bank is accessing the same element. It is literally one-bank-to-16-threads.

  2. There is locality with constants as well, but I believe this section is primarily comparing global memory accesses to texture fetches.

  3. Coalescing is separate from Compute specifications. To me, coalescing seems like an implementation detail, while the others are inexorably tied to how programs will run on the GPU. Adding coalescing to the specifications could potentially necessitate a different spec for each card.

tmurray, thanks for looking that over.

Then I personally just don’t understand the no-conflicts example in the right half of Fig. 5-8. I’m not saying I’m not a bonehead, but I felt I understood bank conflicts pretty easily until I hit the broadcast section.

Yet coalescing criteria are defined specifically by Compute Capability number in App. A is the first place I look to know what’s different based on CC.

Ahha! I forgot about this. You get one broadcast. So you can have N-threads-to-one-bank per broadcast, but you only get one broadcast. More than one and it’s a bank conflict.

As for coalescing, I’m assuming that’s just for brevity in the coalescing section (“all devices with these capabilities operate like this”) and not that coalescing is a required part of the spec. But, I’ll look into it and try to get that clarified in the future.

Thanks for your feedback.


Truly appreciate your time and effort on making other’s lives easy!

Its always to good give such feedback! I remember starting a big thread on “what is device memory” almost an year back! It was too confusing for a beginner!

I have found NVIDIA to be receptive to constructive feedbacks! I hope they will take this up seriously!!

Continue the good job!

Best Regards,

This seems like a good place to reiterate my request for hyperlinks in the PDFs. It would save a lot of scrolling and clicking around to switch between places.

It was also never clear to me from reading the programming guide that coalesced global memory reads are faster than texture lookups. In practice, this seems to be the case, but section 5.4 seems to imply that there is really no reason to ever use global memory instead of textures. A little more clarity about this would be nice.

It’s also confusing how the functions are currently split up between the programming guide and the reference manual. Plus, to see the cost of various math functions in the programming guide (section B.1), one must return to section 5.1.1. It would be nice to have an extra column in the table giving the number of cycles each requires.