is there a document, manual or even book that describes device/ architecture hardware at a level lower/ more comprehensive than the programming guide, the architecture white paper or the ptx document?
for instance, both intel and amd generally have documents that describe their chipsets per functional block/ unit, and these in detail, and so forth
Given the highly competitive nature of the graphics business, Nvidia considers that information proprietary. Which is rather at odds with being an HPC company as well. It is a requirement to understand the low level architecture to be able to write the highest performant code.
If you’re curious and have the time you can use my Maxwell assembler to write clever little probes of their hardware and discover just about any detail you like. I’ve done this to some extent but haven’t had the time to be thorough about it.
“Given the highly competitive nature of the graphics business, Nvidia considers that information proprietary”
i suppose that then makes you a cuda villain…
perhaps you can write a ‘villain’ book, disclosing such information
black market is a beautiful place
i really need to understand the load/ store unit(s) better, to understand global memory reads better, to write my kernels better; every second turn i run into a global memory read
That’s an area I’d love to understand better as well (especially the texture cache). Wish I had the time to construct some proper probes of that.
you really work for intel/ amd, don’t you; ‘probing’ nvidia’s price architecture
have you thought on load/ store unit probes to date? what do you think would be good probes in that regard?
share some thought, would you?
Well I just posted some latency and throughput numbers in another thread. But that’s the easy stuff to measure. What I’m more interested in is the exact behavior of the caches under a range of access patterns. So a probe would be to setup those access patterns and see where you’re hitting cache (texture or L2) and when you’re not.
As for general advice about accessing global memory: The compiler tries to do this for you somewhat but do unroll your loops some so that you’re loading in several global values at a time (ideally with a vector load), before you proceed to process them.
Unrolling inner loops to allow loads to be batched and issued early is a strategy the compiler uses quite often, so before starting on manual code manipulations it is a good idea to inspect the generated machine code to see whether the desired code structure is already present. It is also a good idea to convey the maximum amount of information to the compiler via source code by using the
__restrict__ modifiers where ever appropriate. The latter in particular allows the compiler to schedule loads more freely. Use
volatile as sparingly as possible as it interferes with the compiler’s ability to move variables into registers and avoid loads.
Note that load batching and early issue of loads, while beneficial in helping to lessen the impact of load latency, can easily increase register pressure, possibly by very significant amounts. This is where the CUDA compiler frequently got into trouble in the past, in particular with double-precision code on the Fermi architecture. Higher register pressure led to decreased occupancy and/or register spilling, actually reducing the performance. This is much less of a problem with architecture sm_35 and later which provide more copious register resources, and my observation is that the compiler adapts to that by using schedule-ahead more aggressively on recent architectures.
GPUs require natural alignment of all loads. Since there is often insufficient information about data alignment, the compiler cannot easily auto-vectorize loads, as a wider vector load with its more stringent alignment requirements could fail (i.e. invoke undefined behavior) after a simple pointer conversion, say from
float4. This is best addressed by using appropriate vector types in the source code, meaning the programmer takes responsibility for maintaining proper alignment for all loads.
The GPU has finite-depth queues for tracking loads and stores. Once a queue is full, a stall will result. The total amount of data tracked by pending operations in such a queue increases with the width of each access. Assuming most of the accessed data is actually consumed, memory throughput is therefore optimized by using loads that are as wide as possible. In particular one should strive to avoid loads that are narrower than 32 bits.
That’s all excellent advice. Which reminds me… I wanted to construct a probe to measure the depth of those memory load queues. I’m guessing the size is about 4. I know the unload rate at least is 1 per 8 clocks (this is the throughput of LDG).
I am reasonably sure everything I have covered above is also covered in NVIDIA’s official documentation, but I am too lazy to track down chapter and verse :-)
While I agree that there can be legitimate needs and uses for detailed hardware specifications, especially for ninja programmers, I will also opine that the vast majority of CUDA programmers will never desire or need this information. That said, I have no knowledge about the depth of any of the queues in current GPUs.
Over the past nine years of programing with CUDA I have increasingly abstracted my optimization process from detailed hardware specifics. A new hardware generation rolls around every two years or so, usually with significantly different hardware characteristics compared to the previous architecture. This makes many hardware details throw-way knowledge that I consider a luxury to acquire. In addition much of the code I was involved with needs to run well across two or three generations of GPUs, which involves some level of compromise approach due to architectural differences. My focus has therefore been on how to use existing tools and libraries well, and to provide feedback to improve them, in the spirit of “a rising tide lifts all boats”.
I would claim that my “broad strokes” optimization approach has served me well, although I have on occasion been “scooped” by colleagues who did take the time to understand a lot of the details of specific GPU hardware. In addition, in the big picture of things I have found that wide knowledge in the relevant application domain is a bigger factor in a successful CUDA implementation than loads of CUDA knowledge, much of which can be acquired relatively quickly with the help of existing documentation augmented by some amount of experimentation.
My own take on optimization is that I’m just not satisfied with tools that behave in unpredictable ways (nvcc, ptxas). I’m ok with using high level abstractions, I just want to fully understand how they work under the hood, and I want the ability to bypass them for those edge cases where the abstraction just doesn’t fit.
If I had enough time I’d like to build a new language on top of my assembler that does exactly that. It would manage all the tedious aspects of programming but give you full control over the overall layout of your code, both at the macro and mirco level. It will not try any fancy but unpredictable optimization techniques (unless explicitly requested to do so), and will instead trust that the programmer understands the hardware and knows what she’s doing.
Alas, work is currently keeping me too busy to tackle this challenge for now.
Does “tools that behave in unpredictable ways” include all compilers, regardless of platform? Today’s sophisticated tool chains incorporate myriad heuristics, and no matter what tool chain I have used up till now, I have always run into unexpected or undesirable code generation issues somewhere. In the olden days, I did quite a bit of Z80, x86, SPARC, and ARM, assembly language programming for optimal control and performance and I studied the micro architecture of x86 processors in detail (sometimes excruciating detail!). It often was fun, too. But over the years I came to the conclusion that from a software engineering and business perspective, such low-level programming does not make sense, beyond some unavoidable minimum.
Dual goals of ease of programming and full control down to the micro level seem contradictory to me. It may be possible to achieve that in a domain-specific language (proof pending), but for a general-purpose programming environment it seems akin to squaring the circle to me.
In practical terms, you could consider lobbying NVIDIA for the release of more detailed hardware documentation, and/or an assembler. PTXAS obviously is, despite what the name may suggest, a compiler, with all the loss of control that typically entails.
"trust that the programmer understands the hardware and knows what she’s doing. "
the programmer is a she?
nevertheless, well said
“Well I just posted some latency and throughput numbers in another thread”
which thread, crouching_tiger_hidden_probe?
also, do these measures reflect on a) data known to be cached (a.i L1; a.ii L2) and b) data known not to be cached?
secondly, can you perhaps tie ptx instructions like prefetch and ld in with the mentioned queues of the load/ store units? what is the difference between prefetch and load - does prefetch retire when the data is in cache, and load only when the data is in/ on the sm?
thirdly, do you think loads - anything in the ld/ store unit queue can retire out of order; for example, can it happen that a common load (local/ shared memory) is preceded and succeeded by global reads; such a load would theoretically be able to conclude quicker, not so?
at least in my case, a common occurrence is something like: read global memory, do some work, read more global memory, continue the work - if the queues are that finite, this may very well lead to stalling the load/ store unit (it commences with too many global reads), whilst it could have serviced other functional units of the sm in the meantime…?
njuffa: I think GPU programming (or at least the kind that I’m interested in) is highly compatible to the kind of tool I’d like to build. My field is AI, where the intelligence lives in the data. The gpu is just there to stream it through in the most optimal way. I don’t want or need a huge amount of complexity in the code I write. And trying to write optimal streaming computational code is just not possible with the “Will It Blend™” ptxas. Nvidia knows this and hand assembles their gemm kernels (and is working on hand assembled cuDNN convolution code as well).
jimmy: I don’t know why the english language doesn’t have a gender neutral pronoun so I like to mix it up. Look at the recent __shfl thread for those numbers. I have more if you’re curious about other latencies (cached and uncached).
I haven’t played with prefetch. I’m not sure there’s a good use case for it. You’re probably better off hiding your TLP latencies with enough ILP.
As for loading out of order, I know for a fact this happens with shared memory loads at least. This happens when you overflow the queue. At this point you can no longer count on barriers to synchronize access. Only a shared store will guarantee that. I think Pascal may fix this issue (I talked to the lead of Nvidia’s arch group about this).
As far as mixed shared/global loads, yes by design those can happen out of order. Why would you want a 25 clock request to have to wait for a 200+ clock request?
Otherwise it’s all about balancing the load and compute to hide as much latency as possible. And things are generally more efficient in batches.
" I don’t know why the english language doesn’t have a gender neutral pronoun"
otherwise it would be german…
“As far as mixed shared/global loads, yes by design those can happen out of order”
if a load retire out of order, would this free up space on the ld/ store unit queue, or not?
“Otherwise it’s all about balancing the load and compute to hide as much latency as possible”
and the compiler manages this for you…? at the very least, i suppose it is trying to
are you also saying that the compiler knows it cannot overflow the queue?
i suppose prefetch has some use when you know the data is to be reused, and would not change (the instruction level equivalent of constant or restrict perhaps…?
Obviously completely off topic, but there is no functional difference between German and English in this regard. German has er/sie/es where English has he/she/it. One workaround used in English is to use “they”, but purists would argue this is incorrect in grammatical terms (although from what I understand this singular “they” has some historical precedent going back several hundred years).
I think there may be separate queues for shared and global, though I’m not sure. It’s clear they share much of the same hardware resources, I’m just not sure how much. Globals may stall when the queue is full whereas it’s clear shared is allowed to overflow.
The compiler tries to manage the balance for you, but it is often far from optimal.
Again, I just don’t see a use for prefetch. It just seems like a waste of an instruction to me. The hardware is designed to hide latencies. If one warp stalls because of memory access another takes it’s place. Maybe there’s a use case for it, but I can’t think of one at the moment. I’ve never seen the compiler generate it (though my experience is limited with that).
For what it is worth, I have never been able to find a use for prefetch on the GPU either. Thinking back to my x86 days, it was pretty much the same story: Once CPU hardware acquired the capability of doing its own prefetching based on stride patterns, prefetch instructions pretty much became redudant.
I would love to hear of any evidence to the contrary, i.e. use cases where it can be demonstrated that prefetch instructions provide useful performance benefits with modern GPUs (Kepler, Maxwell) or x86 CPUs (IvyBridge, Haswell).
I just thought of a case where it might be useful and tried it, but ptxas optimized that instruction away. Even tried converting the address to the “generic” space. And I can’t use my assembler to test it because I need to disassemble an example before I can extract the opcode. Might be that maxwell doesn’t support it.
Does the instruction get optimized out even when you specify -Xptxas -O0 to disable backend optimizations? Your hypothesis that it might have been removed because it is a no-op on Maxwell sounds plausible to me.