I’m wrote a small kernel in PTX to try and measure global memory access latency by using the %clock register. It appears that even for coalescable loads, the latency is as much as 1200 cycles. Now, the official party line seems to be that global memory latency is about 400 cycles. This leaves me a little confused. Am I doing something wrong here? Can somebody show me how this number of 400 cycles was arrived at?
PTX instructions can be arbitrarily reordered by the ptxas assembler. You can check with decuda what the final assembly code is. There is no real advantage in using PTX over CUDA, as most optimizations occur during the ptxas phase.
The trick to prevent reordering is to introduce false dependencies. For example :
volatile unsigned int start_time;
volatile unsigned int end_time;
start_time = clock();
value = data[j];
start_time ^= value;
end_time = clock();
start_time ^= value;
You might actually get large latencies (up to 750 instead of 450 cycles) due to TLB misses.
Coalescing does not significantly affect lacency, only throughput.
Thanks for the answers. I’ll try the trick that introduces the dependencies.
I didn’t realize these things had address translation. This is interesting - do you have any pointers to more information on this?
I tried a kernel that read one element from an large array in 8x8 blocks, and one that read it in 16x16 blocks and the 8x8 version was almost exactly 9 times slower - my explanation for this is that 8x8 version introduces 9 memory transactions for each memory transaction introduced by the 16x16 version. This seems to be consistent with your statement that coalescing affects throughput not latency.
Well, this is mostly speculation in trying to explain the results I get. When repeatedly going through a big block in global memory with a stride of 4K and more, and increasing the size of the block, at some point latencies almost double.
Global memory is not cached, and such a huge variation cannot be caused by DRAM bank conflicts only, so I assume it is due to a second memory access to load a page translation entry.
Also, buggy CUDA code that write to random locations in memory always(?) cause an Unspecified Launch Failure instead of crashing the machine or altering the framebuffer contents.
BTW, if the GPU didn’t feature address translation, it would cause security vulnerabilities, for example a remotely-logged unprivileged user reading on another user’s screen by dumping the framebuffer contents through CUDA…
The only source I found that mentioned a MMU is this article by Hiroshige Goto (in Japanese, but figures are in English…)
It seems to claim there is an MMU in the crossbar between TPs and ROPs.
CUDA programs can certainly crash the machine and corrupt the framebuffer. I don’t know if it is because of a shortcoming in the supposed MMU, bugs in the CUDA driver, or only happens when the watchdog kicks in, but it certainly occurs. I’ll maybe try later to see if I can get any data out when reading non-cudaMalloc’d addresses.
Btw, there is an alternate explanation for variable latencies. The DRAM is organized into channels, and depending on how you access the channels (for example, sending all accesses to one or spreading them out) will affect performance by a large amount.
I think to test the latency empirically, a kernel should be made that launches only 1 thread, and reads an array using dependent operations. (The data from one fetch is used to calculate the address of the second.) To test the TLB hypothesis, a stride of 256*(# of DRAM channels – either 2,4,5,6,7, or 8, depending on hardware) should be used to consistently hit the same memory controller, and the size of memory region varied.
That’s right, but this shouldn’t affect latency. I performed my latency test with 1 block of 1 warp performing dependent coalesced loads. 1 half warp or 1 thread gave similar results. I also ran the code inside a loop to warm instruction caches and TLBs. I measure the latency of the last access I do.
Actually my results shows three latency levels, not two. On a 9800GX2, when varying the data size, keeping a stride of 4K :
from 4K to 64K : 320 ns
from 128K to 8MB : 350 ns
16MB and more : 500 ns
With strides smaller than 4K, latency always stays at 320 ns, which suggest 4K pages.
Larger strides (8K, 16K…) proportionally increase the threshold between the 1st to the 2nd level (128K, 256K…), but the 3rd level is always reached at 16MB.
So I assume there is a first TLB of 16 entries, either fully associative or with a non-trivial address hashing function.
Then, either a (huge!) direct-mapped second TLB of 2048 entries, or more likely a cache containing page table entries. Its latency is 30ns, or around 16 cycles at core frequency.
My personal guess is that PTEs are cached in the L2 instruction/constant cache in each TP, but this needs much more additional testing to confirm (fill the L2 constant cache at the same time to cause conflicts…)
I finally had the time to try this out again, and my measurements on an 8800 GTS 512 indicate that there is 16-entry TLB, probably fully associative, and page sizes of 512kbytes. The TLB miss penalty seems to be about 300 cycles, while the memory read latency is about 450 cycles on average. The only difference in latency between these two levels that was not explained by the TLB misses was a slightly higher latency for an array size of 2M with strides around 64k - these accesses seemed to about 10 or so cycles slower.
I’d appreciate any comments on this.
PS: Apologies for the delay in responding. I was unfortunately very busy with other things during the last few weeks.
I’m going to be a busy again over the next week [lots of exams coming up :-(] , but I’ll try and take a closer look at what’s happening and let you about the results as soon as I find some time.
Do you mean conflicts within the DRAM chip or conflicts across the 4 memory banks connected to the GPU? In either case, won’t they be repeatable for specific strides, and be relatively insensitive to array sizes?
So what’s the bottom line? What do we learn about the relationship between latency, occupancy, bandwidth? Does it change between architectures? Do we have to worry about it at all, beside having eg “at least 256 threads”?
I have another experiment which shows that latency is directly proportional to the number of threads resident on the SMs, when all threads were simultaneously making requests. I thought this result was encouraging from the point of view of analytically modelling GPU performance. I guess from a performance optimization point of view, the implication is rather obvious, fewer memory requests translate directly to reduced execution times.
The moral of the TLB story seems to be that a lot of disparate reads and writes to far-off locations are a lot more expensive, and so should be avoided. There’s a paper by He, Govindaraju et al in SC’07 which addresses this. They attribute a lower value for what they call ‘random bandwidth’ to caching effects. I suspect that they may have been running into TLB effects. They suggest that scatter/gather type operations which are random in nature should be done in multiple passes with each pass accessing a smaller subset of total region for improved locality.
If the page sizes are indeed as big as they seem to be, it looks like the TLB effects can be ignored for the most of the common cases.
I guess most of this is rather obvious from a qualitative point of view. The interesting part for me is the exact quantification of how these factors affect each other.
Oops. A friend of mine had borrowed this machine for a little while. Didn’t realize that he’d logged in to the forums and left the “remember me” option turned on. I should clarify that all errors that were made in the previous paste are mine only.
So accessing a large dataset randomly hurts bandwidth, and this cannot be hidden by occupancy? At what dataset size is this felt? Or if it can be hidden by occupancy, what’s the “golden threadcount” for small datasets, and for large ones?
My interpretation of their results on random bandwidth is that when their array size is greater than the page size * tlb size, (which seems to be 128k*16 = 2M), the fact that they’re doing uncoalesced access, along with a large number of TLB misses is what’s causing their slown. Their solution is to do the same accesses but in sequential passes. The first pass writes from 0-256k, the second writes from 256k-512k and so on.
Oops, it looks like the compiler moved one clock() call before or after the load, despite the volatile attribute. Are you using 2.1 Beta?
I should use a couple __syncthreads() to really enforce the execution order, as you do in your program.
Unfortunately I am currently as busy with other things as you are, I guess :(
They should… But on second though and looking at your plot, I think it’s actually the same “1st TLB level” as in my results.
From a theoretical standpoint, you just need to multiply the average latency with the bandwidth you want to achieve to compute how much data you need to have in flight, which is directly related to the threadcount.
For example on a 8800 GTX:
Without any TLB miss: 300ns times 70 GB/s gives 20KB. If each thread reads 4 bytes at a time, you need 5250 threads, so roughtly 330 threads/SM (assuming coalesced accesses).
With only TLB misses, latency is around 500ns, so you will need ~550 threads/SM to achieve the same bandwidth.
Even then, the max achievable bandwidth will be reduced anyway, as it will be consumed by page table entries traffic.
All this is assuming the latency does not increase as pressure on the memory subsystem increases…