Since this is a post on the current state of the art of the toolset, overall implementation issues, and the architecture, of interest to anyone embarking upon a CUDA project, I thought it belonged here:
Having spent some frustrating times a few weeks ago, porting code, I thought I should warn others to keep a VERY open mind regarding what might be causing a problem. I consider mine a small pilot project and so I expect there are a lot more problems than detailed here. Sorry if they have been documented elsewhere.
Really the ULF (unspecified launch failure) is the new GPF only worse, as there is zippo info. Certainly one is likely to get one if you write outside array bounds, but there are lots of causes and perhaps this topic should list those that everyone has found.
I gave away developing on 64 bit early on as it is much more fragile (to be expected as it is new) and you are much more likely to get a complete system freeze due to bad code generation than on 32 bit, though I have seen just adding one “volatile” to an otherwise working kernel causes a power off grade lockup in 32 bit mode. Bad code gen in ptxas plus missing hardware protection . 64 bit is also slower if you are doing any global memory refs and can use heaps more registers if you do any pointer arithmetic.
If you allocate too much memory before launching a kernel that does not even use any global or implied local memory you will get a ULF. Doco does not say how much to leave… perhaps the minimal should be allocated when the context is initialised.
Don’t let your code get too big - I have one example where if I cut and paste a second copy of EXACTLY the same kernel into the .cu file and just change its name: the first copy runs fine and the second one ULFs. Bad code gen somewhere - oddly the cubin has a bigger constant segment for the second copy that the first and the second copy uses more registers (spills to lmem) and increases bincode??? That was a 30Kb bincode kernel. A later smaller kernel is not broken but putting a small one first in the .cu file can break a larger one. Looks like an intialisation problem in ptxas. I am pretty shure this was reported in version 0.8, where Nvidia did not take the user too seriously, I noted there was a bug in shared mem allocation that might have explained it then - definately not now.
The whole nvcc process is quite slow and by my extrapolations it would take more than 3 minutes to compile a 2Mb bincode kernel! Assuming all temporary files fit in memory.
Using volatile shared to try to control CSE is dangerous and can result in bad code, a ULF or a system lockup. Not much in Nvidia’s test codebase. This CSE thing is a real problem and if Nvidia are not going to make bincode public then register allocation should be moved back into nvopencc (not inconsistent with the PTX spec) - with the way it is set up at present ptxas needs to be able to undo CSE that the compiler has done to get over the peak register usage point. Lots to go wrong. There is quite a lot of recalculation one can afford to do in the time it takes to write and read a register from local and one should keep the memory bus as clear as possible. When optimising CSE vs local storage one needs to know the occupancy the code will be run with - currently ignored. When there is a problem in ptxas, one has to devine the answer from your application’s symptoms.
Don’t try to push registers down too far - it can send ptxas into an infinite loop. Also I note from other posts that clamping registers can cause a ULF - bad codegen in ptxas.
Ptxas uses too many registers. The solution sometimes is to try and think the way Nvidia do - look at the samples and try coding that way, it often helps as this is what they have been testing with. It is quite easy to work out what an upper bound on the minimum number of registers required for a given kernel is (add up autos for the deepest block nesting point and add 4 for expression eval if using MADD, in 32 bit mode), however ptxas can use twice that with CSE it is not willing to undo - it is the address calculations that you can’t see that can cause a problem as well as what is obvious. That upper bound is really quite generous and correct colouring should expose savings, as all we are concerned about is the minimum amount of state.
Just had another example (new bug about every couple of hours I look at CUDA) switching some registers to/from shared in one block caused another earlier non nested block to run 40% slower! Reminds me of the old programming metaphor - fix the taillight and the front bumper falls off. No way of working out what is happening with opaque bincode.
SIMD within a warp does not always work due to code gen bugs. I have one example where one has to insert a __syncthreads() into a single warp block to make it work on hardware (this example was not a problem with implied volatility of shared - have seen a report of a problem there, back in 0.8 days). Now I can see nvopencc has inserted a bunch of spurious register loads near the point of failure (the registers already had exactly the same value in them) and that must have thrown ptxas as the symptoms looked like a failure in convergence. All this is undocumented and opaque and also the typical hiesenbug, so it is very difficult to debug.
Also had the inverse where inserting a __syncthreads() in a perfectly safe section of working code within a single warp block caused a ULF. I cannot isolate this one, just lucky I don’t need the __syncthreads() there. Adding what should be an innocuous sync added 52 bytes to the kernel’s constant segment and over 3Kb to the bincode! In another build config (multi warp vs previous single warp) adding the same sync increased the const segment the same but reduced the bincode by 1Kb! No wonder the result is busted (in both cases). Looks like the sync might have turned shared pointers into global pointers…
Parallel reduction within a warp sometimes works and sometimes does not. Normal code also can fail if it relies upon shared memory updates when using SIMD within a warp. There is no doco to indicate when it might fail. When it does not, one needs to use volatile shared (results in many superfluous ld shared) or insert otherwise unnecessary syncs (only possible if your code is not divergent, slowing everything down). Nvopencc’s inter-thread dependency checking is broken. Perhaps at this time (till it gets fixed) we need a Clayton’s sync that tells the compiler to throw out any shared or globals it has in registers but does not insert a bar.sync. Could be useful for working around other compiler bugs in divergent code. Normally one would use a null external call but we can’t put anything that cannot be immediately inlined into a kernel.
The order of execution of divergent code segments is undefined and not documented. I had to insert logically redundant conditionals into my code to force the compiler to get the hardware to execute alternate segments in the right order. Not easy to debug. Needless to say the emulator did it the other way round. I have been waiting for months for a reply here regarding divergence and convergence algorithms. I even designed what I believe is the minimal algorithm for implementation in that topic and did simulate to check it, and it works well. I have since updated for correct operation in the presence of real subroutine calls within divergent code. We still desperately need a formal specification of exactly what SIMD means in divergent code, it is not obvious. Because it is not documented in the guide, anything goes, and one cannot tell what is a bug or a feature.
Found an occurence where the compiler generated code with order of execution of divergent segments incorrect, it assumed one way and ptxas/hardware did the reverse causing incorrect values to be calculated as both segments referenced the same shared memory location (written in one and read in the other, which is not what I coded, just a result of its internal suboptimisations). Now doesn’t that make you nervous?
Then there is the issue of occupancy - the mantra says get it as high as possible - Nvidia don’t tell you that 33% is sufficient to get 96+% of CPU performance and best device memory performance is somewhere in the 33-50% range EXCEPT for warp coalesced 32 bit reads that need 100% occupancy. So you can waste a lot of time finding this information out, and trying to reduce registers when it turns out not to be required.
The complexity of the interaction between instruction fetch, warp scheduling and arbitration for global memory access means that G80 performance is not predictable. It appears from posts here that Nvidia don’t understand how these things interact, and nothing at all is documented. We are back to the lack of predictability of performance of a cached processor. The only way to find out is “suck it and see”, which is very expensive.
The overall message is “If something can go wrong, it will!” - it is often not your fault, certainly don’t assume it is, and trial and error is the only way to find the problem (CUDA shuffle) - slow if you have to wait for a reboot each time and frustrating because lots is undocumented and requests for fillin info are ignored. So much time is wasted having to measure and decipher what should be documented.
PS I am not posting any of these bugs until Nvidia gets an engineer who knows the harware well to go though my posts and answer them and point out where I have the wrong end of the stick and why (also what might be being done, to encourage us to wait). Nvidia’s loss, not mine. No credit for reporting bugs, or providing useful source code here.