I have written a “huge” kernel program, just below the maximum allowed size by the compiler.
It has a lot of “if” statements and for loops. I use almost 16 Kb of shared memory. Each block uses only one thread. If I can compress the shared memory to 8K, I will be able to use 2 threads ( I am right? ).
I know it is actually “not done” to use if statements and loops on a GPU because it slows down but still I expect some descent performance.
My GTS is running at 600 Mhz. So I roughly expect the GPU to be 6 times slower than the CPU when I run the same algorithm. But I notice a factor 100.
So I am searching and try to find ways to optimize.
Does somebody have some general hints to make it faster.
You are using only of a small fraction of the available processors on the G80 and probably spending most of the time waiting for data (the processor has no other threads to run while is waiting for memory reads/write).
Massimiliano, I was thinking about starting a topic on “rules of thumb” to assist people who are starting off. One is the naive port - what sort of comparison in speed between a current say core2/athlon core and a G80 core single threaded? I know it depends upon memory access, however if the G80 device memory bus is mostly idle you are not going to have much latency. Or is there a significant latency even if the device memory bus is idle? (factor of 100 sounds much worse than my guess)
Thanks Massimiliano but there is not much there that is not in the guide and samples, as I found last time I looked at this lecture series. Tree scanning is better explained and that is helpful. Hieronymus can run 1 thread per multiprocessor (I think of these as G80 cores not the individual ALUs). This issue of device latency is the only game in town when it comes to porting performance (easy to minimise shared mem bank conflicts, nothing you can do about register conflicts). There appears to be no reason for the quoted 200-300 cycles IF there is one thread running on a G80, or even 1 on each core as they are unlikely to clash. Hence my question regarding device memory access latency when the device memory bus is idle. If the implementation is perfect then it can be as fast as register access in this situation (there are 3 memory clock times to get the resulti - it might take 4 for DDR3 random access but I am not that familiar with DDR3).
ed: hmmm… no answer so I presume your answer still stands, I can see one of my fundamental assumptions being blown out of the water here. What you seem to be saying is that device memory cycles are allocated on a fixed round robin basis for all warps (24x16 = 384 on an 8800GTX), even if the warp is not processing, or even included in the current kernel runtime configuration. Please advise!
ed: make that half warps as coalescing does not work over 1/2 warp boundaries, so 768 memclocks to get around to each 1/2 warp in turn…
I do not think it is even physically possible for a value to be obtained from the gpu memory in 3 clock cycles based solely on the distance the information must travel. It is actually a pretty long way to the memory chips and back and even at “light speed” (not nearly as fast in the wires as in a vacuum) information can’t actually go that far in a nanosecond (one clock).
Most of the 200-300 cycles are probably due to simple physics, not whatever the memory controller is doing.
Registers are fast because they are close to the processor. RAM is slow because it is far away.
Your argument is bogus. All DRAM (including GDDR3) has significant access latency, some of which is due to the memory itself, some due to the tasks the memory controller must perform to address the data, and very little due to physics. Light (and voltage potential along a wire) travels approximately one foot in one nanosecond. The physics of the voltage potential propagation does not account for most of the 200-300 cycle latency.
Notice the distance indicated that information travels in one clock. I do not know how this distance was calculated, but these are Ian’s slides and Bill Dally’s picture, both people I would tend to trust. If this picture is to be believed, in fact, information is moving only a fraction of a chip’s length every clock, not one foot as a naive calculation using the speed of light would suggest.
The slide may well be correct about propagation across a silicon chip, but as you hinted, we’d need to know how they decided on the length of the “1 clock” line. In any case, the slide doesn’t back-up your earlier statement.
The reason why the slide doesn’t matter is you claimed memory latency is due to the distance between the GPU and RAM - a distance which is not connected by a continuous line of these chips. The connection is made by copper conductors in the circuit board of the graphics card, where the voltage potential propagates near the speed of light.
The latency is not mostly due to physical distance. DRAM requires quite a bit of work to address and access. See the following:
My statement of the latency being MOSTLY due to this effect was incorrect, but my original point, that it would be impossible to get a value in 3 or 4 cycles from RAM based on physical considerations alone stands - which was my original point in bringing this up. Information has to get from somewhere inside the GPU to the pin and then across the wire to the RAM, from somewhere in the RAM to the pin, across a wire to the GPU and then from the pin of the GPU to a register. So probably 10-20 clocks are due to purely physical considerations. Even if it only takes one clock to go across the wire between the two chips.
Even if the device memory bus is idle, there is still the memory latency, and the memory controller, so it is never going to be as quick as on-chip registers.
I believe GDDR3 (Wikipedia GDDR3) is similar in timing to DDR2 (Wikipedia DDR2) . The ‘top-level’ clock advertised e.g. 1800-MHz is the peak transfer rate, and not the access time or latency. I believe the latency is variable, depending on the addresses of recent accesses.
Based on an example from DDR2 memory may be quoted as 5-4-3-6-7, where the numbers stand for:
5 = CAS Latency (CL) - Delay between activation of row and reading of row
4 = RAS to CAS (or Row to Column Delay) tRCD - Activates row
Latency = Bank Activate (tRCD), the Read (CAS) delay and the Precharge Delay (tRP) = 4+5+3
= 12 * 4 = 48 memory cycles.
Then the data has to get clocked in and fed through the memory controller, which may hold it until all of the burst has arrived.
A few more cycles may get eaten if the G80 memory controller goes for longer transfer bursts (I think this is an option, but I’m really on thin ice here).
Presumably, there is also contention for memory access for program code, and maybe to video memory (or is GDD3 dual-ported? I’ve never discovered that.)
I am not suggesting that this accounts for ‘200-300’ cycles on the G80, but I think it could explain a chunk of it. (Memory latency on a traditional CPU like an Athlon 64 can be 100’s of processor clocks, so I’m ready to believe 200+ is the actual figure, I just don’t see how yet.)
Where was this mentioned? I would like to know if that is the case too. It sounds simple to implement in hardware, would guarantee ‘liveness’ on a heavily loaded system (and a lightly loaded system would care less), but may impact some of my thinking.
As you have pointed out before, we are working in the dark. It would be easier with full Instruction set and architecture manuals for the G80.
Thanks Garry for your thorough analysis. The 48 memclocks latency translates to 16 GPU clocks which is a fraction of the 200-300 quoted. It has not been said anywhere that this is how the G80 cards work, just me trying to put all the pieces of information together so that I can understand and make sensible design tradeoff decisions. I was going to start a new topic on the device memory subsystem and coalescing as I believe the manual’s treatment leaves much to the imagination (while it is complete for shared memory). This is more important than shared memory in larger applications. I am slowly coming to the conclusion that each G80 core time slices 2*24 ways, irrespective of the number of warps running, in a fixed manner and that the memory subsystem probably does the same. When you make this assumption things seem to add up. Certainly time slicing the memory subsystem avoids what would be a difficult metastable state problem.
Example: a float read is 512 bits in a 1/2 warp, that is two bites of the cherry on a 384 bit bus, if it gets round robin scheduled then it will take on average 1.5x cycle time in latency = 768 * 575 / 1800 * 1.5 = 368 GPU clocks, now this is a bit long so perhaps the cycling is per warp and each of the slots could be stretched arbitrarily depending on whether the mem controller has already fetched or is about to fetch inpage or perhaps a whole new cycle is required. Other questions come up - is it one 384 bit bus or 3 * 128 bits with separate address buses …
Just guessing, seem to have to do a lot of that here. Todate I have been stonewalled whenever anything gets close to the implementation. We are having to program close to the metal here, without the necessary detail.
Sorry to be pedantic, but could you please give me a pointer to nVidia saying it is 200-300 GPU clocks? I have always read this as memory clocks.
Just to avoid misleading anyone, the numbers in the latency estimate were just a plausible order of magnitude, chosen to make it easy to see where the calculation comes from.
I have taken some numbers for GDDR3 from Samsung spec sheets; I haven’t been very diligent and worked through Hynix, infineon, etc, and the spec’s I looked at couldn’t reach 1800MHz, so this is still only representative, and not definitive for nVidia cards.
Add a few clocks delay for the memory controller, and it still comes in around 1/2-1/4 of the ‘200-300’ cycle delay.
Sun’s Niagra (I haven’t read the details) handles the mismatch between CPU speed and memory speed by simply putting threads ‘to sleep’ (Hyper-Threading) and then waking them up when the data is ready (this is in hardware, at CPU clock speed). I must admit, I like this approach. I had assumed this is the same approach used on the G80, but we still haven’t enough information to explain where the cycles go; the data should be ready much faster than the 200-300 cycles.
I assumed at least 3 x 128 bit memory buses because nVidia never claim greater than 128 bit reads, and hence one 384 bit (320 bit) bus would waste bandwidth and power. It maybe more buses, i.e 6 x 64 bit buses, though that’s 3 more sets of address and control pins, whereas a 4 pin ‘chip select’ would give 32 bit reads anyway and keep power down.
8800GTX with 6x64 bits would make some sense, though, as my 8800GTS has only 320 bits, which would be 5x64 bits. I haven’t the nerve to take it apart to try to see :wacko: .
Again, it would be nice to know how many independent, concurrent, 64 bit or 32 bit memory accesses are supported by the G80.
I must admit that the manual is not explicit about GPU clocks, however that is the only version of clocks quoted from end to end of the manual so it seems reasonable and sensible to quote GPU clocks. When analysing your program everything is in GPU clocks. I made this statement elsewhere on this forum and it was not contradicted (that does not mean much).
My example calc was of course incorrect, after posting I noticed, but though it might encourage a reply from Nvidia. If we do have 768 memory access slots available at the memclock freq then the average latency would be just 1/2 * 768 * 575 / 1800 = 122 GPU clocks as no doubt once a 1/2 warp has the bus it does it’s current instruction’s worth of accesses in 1 go. Perhaps it gets thrown off if it goes out of page.
When you do the pure bandwidth calculation the access time for a memory bound app is 327 GPU clocks for a float on 8800GTX (assuming 100% occupancy - but that may not matter!)
All this is so important for porting as the difference between getting your memory access right (aligned and coalesced) and random float reads is nearly 2 orders of magnitude in memory bandwidth! (measured elsewhere on this forum) So read that as, this makes the difference between a port succeeding and failing.
If there is a fixed latency for device mem access than we all have to shoot for 100% occupancy and that means <= 10 registers which limits the complexity of your kernel… then you have to break up your app into lots of calls and then you are hit with kernel launch time latency. The tradeoffs are still not clear!
Nvidia can we have some guidance please.
ed: I must say I originally assumed that the hardware scheduler just went round robin from one warp to the next among its max 24 as soon as there was a stall (which could be a register conflict or device mem access) but it is not safe to make any assumptions.
further ed: measured float read performance on 8800GTX (fully aligned and coalesced) is about 70Gb/sec giving a real world float read cycle time @ 100% occupancy of 400 GPU clocks. So I think the number in the book is in GPU clocks. I guess that float3s should give best performance on a 384 bit bus but that might not be the case. All depends upon the coalescing and bus logic???
ed: the above measured performance of 70Gb/sec is inconsistent with the statement that coalescing does not happen across 1/2 warp boundaries (you only get a theoretical max throughput of 57Gb/sec - 2 mem cycles required to get 512 bits). However if you allow the mem controller to hang onto the last read and hand part of it back to the next 1/2 warp - like a prefetch then we are home OR indeed if there are 3*128 bit buses. The important thing here is that alignment is not nearly as critical as one might think and that sequential access is by far the most important. Again need details of the mem subsystem to make these decisions!
All this is relevant to the original topic. Performance of a naive port…
What has just now surprised me is that all G80 hardware slowed down by a factor of two with the 0.8.1 release and there has not been any comment! Section 6 of the 0.8.1 manual says global read latency is now 400-600 clocks and minimum instruction execution time is 4 clocks. Now are these the same clocks that we are used to (575MHz on 8800s?) or a new 2x clock? I missed any mention when I went looking…
Now this does agree better with my calculations but still we need to know what is the effect of reducing occupancy upon global read latency.
Some other numbers that would help are: out of page latency (for a new memory cycle as calculated by Garry, in GPU clocks) and also inpage latency (breaking a burst read but staying inpage) and then also what is the memory page size for 8800GTX, 8800GTS and 8600s?
Given this information and details of coalescing (current documentation is misleading) it should be possible to work out the various tradeoffs and actually do some system design.
Again can we please have some help - I am sure there are many interested members here.
ed: OK yes everything is spec’d @ 1.35Ghz now, so we are back to my calculated float read cycle time of 800 GPU clocks compared with the spec of 400-600.
ed: Actually I think it is 1.15GHz for GTX (& 1GHz for GTS) so Chapter 5 is wrong (as were the old manuals)
ed: Another query - there has been no change for the clock() function in the new manual - which manual is correct - the old one or the new one?
ed: Answer is both, just that the 0.8.1 manual does not apply to the current software :huh:
Bump - since last posting my topic http://forums.nvidia.com/index.php?showtopic=35026 answers hieronymus’s original question, one would expect a single thread on a G80 to be about 1/80th the speed of a single core core2/athlon when CPU bound in the cache (1/2.5 x clock and 1/32 for the hardware scheduler at this low occupancy). The rest is probably the lack of cache for global refs on the g80 - would have expected more of a hit here.
In the above mentioned topic the useful result that in a CPU bound app one can get close to 100% utilisation of a MP core with 33% occupancy, means here we have 32 registers to play with. My biggest problem is running out of registers.
This leaves the issue of how does global memory latency in the memory bound app depend upon occupancy?
If not memory bound we need to know whether it is a round robin system as it does mean that one has 200 free instructions between global references, can affect how one structures code. If a thread just misses its slot then does it have to wait another 800 clocks?
Does anyone care to decode the details of the memory subsystem using a technique similar to the instruction timing topic to get some useful results? Don’t need to know exactly how it works just build a model that is useful to predict, without having to suck it and see for every app.
I guess it is obvious that nvidia aren’t going to help here - odd since performance is the only reason for using GPGPU - everyone here should be interested in this topic. The core question is behavior of the memory subsystem at lower occupancies. It seems no-one took it up and ran the required tests, so now that I have running hardware I ran them and can give a short summary (of around 500 samples across varying occupancy, number of blocks/threads, alignment, 32-64-128 access, sequential, inpage and wide random). Some unexpected results that everyone should be aware of. Shows how much the guide is a marketing document. All these comments for a memory bound program:
G80 hardware is specially optimised for reading 32 bits items sequentially and using these tweaks requires warp wide coalescing alignment (align 128) NOT 1/2 warp as it says in the guide. All other modes of access only require 1/2 warp alignment for maximum performance. I believe this is the correct optimisation as it affects all local register reads, unfortunately writes do not benefit from whatever it is.
The real sustainable read rate is about 80% of the quoted marketing rate (70Gb vs 86Gb on GTX) as mentioned above.
Coalescing hardware is pretty simple as any misalignment destroys all coalescing instead of just causing one extra memory cycle for the warp wide access. The resulting transfer rate rate is pretty much the same as the inpage random rate. This really needs to be addressed in hardware before a stack becomes useful.
Memory pages are 32Kb on 8800 and one can expect 3x the transfer rate random inpage compared with wide random. Definitely worth planning for locality on this scale.
Writes are asynchronous, can’t tell if the buffer is more than 1 deep, does not seem any reason for any more. As a result of this write performance is relatively flat across a wide range or occupancy, thread and block configs.
Don’t use 64 and 128 bit operations for sequential read access. It’s busted somewhere.
64 and 128 bit write operations do give some measurable improvement for sequential access.
100% occupancy is only required for the operation of 1. above. Otherwise there are all sorts of odd behaviors - like write throughputs (except for 32 bit) tend to drop marginally as occupancy goes up above 50% - hardware bug here? The general rule seems to be 50% occupancy with 1 block per MP gives best performance, so 21 registers (or 20 given we can only set mod 4 - what a pain that one can’t set 10, even for testing).
OK a few measurements - if we set fully coalesced 32 bit reads as 100% then:
384 threads 1 block/MP (50% occ) = 70%
fully coalesced 32 bit writes are 63% from 12% occ to 100% occ and most thread block configurations (couple of dips in the middle)
misaligned sequential and inpage random 32 bit read = 11%
as above for writes = 7%
random out of page 32 bit reads and writes = 2.4% (1/2 that @ 100% occupancy)
inpage random 32 bit read 10%
inpage random 32 bit write 9%
more for 64 and 128 bit…
wide random 128 bit reads and writes 9% (not 33% as I was expecting)
Note on the previous posts the 32 bit fully coalesced read cycle time is actually 950 clocks - so round 1000 clocks for that special case only and much more for everything else.
ed: was slightly optimistic on a couple of numbers