CUDA and Murphy's Law Some things you may bump into...

Since this is a post on the current state of the art of the toolset, overall implementation issues, and the architecture, of interest to anyone embarking upon a CUDA project, I thought it belonged here:

Having spent some frustrating times a few weeks ago, porting code, I thought I should warn others to keep a VERY open mind regarding what might be causing a problem. I consider mine a small pilot project and so I expect there are a lot more problems than detailed here. Sorry if they have been documented elsewhere.

Really the ULF (unspecified launch failure) is the new GPF only worse, as there is zippo info. Certainly one is likely to get one if you write outside array bounds, but there are lots of causes and perhaps this topic should list those that everyone has found.

I gave away developing on 64 bit early on as it is much more fragile (to be expected as it is new) and you are much more likely to get a complete system freeze due to bad code generation than on 32 bit, though I have seen just adding one “volatile” to an otherwise working kernel causes a power off grade lockup in 32 bit mode. Bad code gen in ptxas plus missing hardware protection External Image . 64 bit is also slower if you are doing any global memory refs and can use heaps more registers if you do any pointer arithmetic.

If you allocate too much memory before launching a kernel that does not even use any global or implied local memory you will get a ULF. Doco does not say how much to leave… perhaps the minimal should be allocated when the context is initialised.

Don’t let your code get too big - I have one example where if I cut and paste a second copy of EXACTLY the same kernel into the .cu file and just change its name: the first copy runs fine and the second one ULFs. Bad code gen somewhere - oddly the cubin has a bigger constant segment for the second copy that the first and the second copy uses more registers (spills to lmem) and increases bincode??? That was a 30Kb bincode kernel. A later smaller kernel is not broken but putting a small one first in the .cu file can break a larger one. Looks like an intialisation problem in ptxas. I am pretty shure this was reported in version 0.8, where Nvidia did not take the user too seriously, I noted there was a bug in shared mem allocation that might have explained it then - definately not now.

The whole nvcc process is quite slow and by my extrapolations it would take more than 3 minutes to compile a 2Mb bincode kernel! Assuming all temporary files fit in memory.

Using volatile shared to try to control CSE is dangerous and can result in bad code, a ULF or a system lockup. Not much in Nvidia’s test codebase. This CSE thing is a real problem and if Nvidia are not going to make bincode public then register allocation should be moved back into nvopencc (not inconsistent with the PTX spec) - with the way it is set up at present ptxas needs to be able to undo CSE that the compiler has done to get over the peak register usage point. Lots to go wrong. There is quite a lot of recalculation one can afford to do in the time it takes to write and read a register from local and one should keep the memory bus as clear as possible. When optimising CSE vs local storage one needs to know the occupancy the code will be run with - currently ignored. When there is a problem in ptxas, one has to devine the answer from your application’s symptoms.

Don’t try to push registers down too far - it can send ptxas into an infinite loop. Also I note from other posts that clamping registers can cause a ULF - bad codegen in ptxas.

Ptxas uses too many registers. The solution sometimes is to try and think the way Nvidia do - look at the samples and try coding that way, it often helps as this is what they have been testing with. It is quite easy to work out what an upper bound on the minimum number of registers required for a given kernel is (add up autos for the deepest block nesting point and add 4 for expression eval if using MADD, in 32 bit mode), however ptxas can use twice that with CSE it is not willing to undo - it is the address calculations that you can’t see that can cause a problem as well as what is obvious. That upper bound is really quite generous and correct colouring should expose savings, as all we are concerned about is the minimum amount of state.

Just had another example (new bug about every couple of hours I look at CUDA) switching some registers to/from shared in one block caused another earlier non nested block to run 40% slower! Reminds me of the old programming metaphor - fix the taillight and the front bumper falls off. No way of working out what is happening with opaque bincode.

SIMD within a warp does not always work due to code gen bugs. I have one example where one has to insert a __syncthreads() into a single warp block to make it work on hardware (this example was not a problem with implied volatility of shared - have seen a report of a problem there, back in 0.8 days). Now I can see nvopencc has inserted a bunch of spurious register loads near the point of failure (the registers already had exactly the same value in them) and that must have thrown ptxas as the symptoms looked like a failure in convergence. All this is undocumented and opaque and also the typical hiesenbug, so it is very difficult to debug.

Also had the inverse where inserting a __syncthreads() in a perfectly safe section of working code within a single warp block caused a ULF. I cannot isolate this one, just lucky I don’t need the __syncthreads() there. Adding what should be an innocuous sync added 52 bytes to the kernel’s constant segment and over 3Kb to the bincode! In another build config (multi warp vs previous single warp) adding the same sync increased the const segment the same but reduced the bincode by 1Kb! No wonder the result is busted (in both cases). Looks like the sync might have turned shared pointers into global pointers…

Parallel reduction within a warp sometimes works and sometimes does not. Normal code also can fail if it relies upon shared memory updates when using SIMD within a warp. There is no doco to indicate when it might fail. When it does not, one needs to use volatile shared (results in many superfluous ld shared) or insert otherwise unnecessary syncs (only possible if your code is not divergent, slowing everything down). Nvopencc’s inter-thread dependency checking is broken. Perhaps at this time (till it gets fixed) we need a Clayton’s sync that tells the compiler to throw out any shared or globals it has in registers but does not insert a bar.sync. Could be useful for working around other compiler bugs in divergent code. Normally one would use a null external call but we can’t put anything that cannot be immediately inlined into a kernel.

The order of execution of divergent code segments is undefined and not documented. I had to insert logically redundant conditionals into my code to force the compiler to get the hardware to execute alternate segments in the right order. Not easy to debug. Needless to say the emulator did it the other way round. I have been waiting for months for a reply here regarding divergence and convergence algorithms. I even designed what I believe is the minimal algorithm for implementation in that topic and did simulate to check it, and it works well. I have since updated for correct operation in the presence of real subroutine calls within divergent code. We still desperately need a formal specification of exactly what SIMD means in divergent code, it is not obvious. Because it is not documented in the guide, anything goes, and one cannot tell what is a bug or a feature.

Found an occurence where the compiler generated code with order of execution of divergent segments incorrect, it assumed one way and ptxas/hardware did the reverse causing incorrect values to be calculated as both segments referenced the same shared memory location (written in one and read in the other, which is not what I coded, just a result of its internal suboptimisations). Now doesn’t that make you nervous?

Then there is the issue of occupancy - the mantra says get it as high as possible - Nvidia don’t tell you that 33% is sufficient to get 96+% of CPU performance and best device memory performance is somewhere in the 33-50% range EXCEPT for warp coalesced 32 bit reads that need 100% occupancy. So you can waste a lot of time finding this information out, and trying to reduce registers when it turns out not to be required.

The complexity of the interaction between instruction fetch, warp scheduling and arbitration for global memory access means that G80 performance is not predictable. It appears from posts here that Nvidia don’t understand how these things interact, and nothing at all is documented. We are back to the lack of predictability of performance of a cached processor. The only way to find out is “suck it and see”, which is very expensive.

The overall message is “If something can go wrong, it will!” - it is often not your fault, certainly don’t assume it is, and trial and error is the only way to find the problem (CUDA shuffle) - slow if you have to wait for a reboot each time and frustrating because lots is undocumented and requests for fillin info are ignored. So much time is wasted having to measure and decipher what should be documented.

Eric

PS I am not posting any of these bugs until Nvidia gets an engineer who knows the harware well to go though my posts and answer them and point out where I have the wrong end of the stick and why (also what might be being done, to encourage us to wait). Nvidia’s loss, not mine. No credit for reporting bugs, or providing useful source code here.

I just wanted to express my support for osiris.

I plan on using CUDA for my master’s degree in the following months, but I get more and more aprehensive when going through all the bugs and undocumented features that several people have found so far. I’m sure there are several others, academics and professionals, that face a similar dilemma.

It seems to me that Nvidia is working really hard to get the compiler (and driver?) right, but there is some lack of communication between those guys and the ones that really know the hardware. Even when reading (and listening) to the Illinois CUDA course, I’ve noticed some discrepancies and some shady unknown bits of information that Nvidia doesn’t like to get into detail. Depending to whom you ask a question, you would get similar (but discrepant) answers.

It seems to me that “we know there are strange things happening but we are not quite sure why, we are looking into this and meanwhile you should follow our recipe for good code practice that we know should work”. This is good for getting started, but not so good for complex problem resolutions later on. Of course they cannot predict all implementations, and that’s what robust compiler/driver technology is for.

People will get frustrated if they spend most of their coding time figuring out bugs that are not their fault. One could just hope that this is a compiler/driver maturity issue, that should be improved very soon. Else, we will be flying blind most of the time, and some of us may give up even before take off.

Therefore, I really hope someone from Nvidia can give some definitive insight and clarification on all these issues, not only those listed here but others left unanswered throughout these forums. It would be reassuring to know that Nvidia is aware of several of these shorcomings and have people working on this. I guess a simple statement that acknowledges all these issues discussed would be the first step to help encouraging everyone to really believe in CUDA.

Also support osiris.

It looks like nVidia is setting a higher goal in compilers than they’re capable of, and they don’t give you enough information to even examine what’s wrong. PTX spec is not really useful with a buggy ptxas. A not-optimizing-at-all compiler would be perfectly fine for us, and could even get better performance, if there’s enough document for the coder.
Also, their compiler group seems really arrogant. It’s impossible to turn off “optimization” in nvcc, or to shut up its warnings, and ptxas goes wrong more with -O0 than -O4. Ironically, the first statement in Illinois CUDA course is that one should place correctness before performance. I know they may not be really that arrogant, but it at least looks like that for now.
Maybe nVidia should learn something from Intel. They provide coders with both detailed processor spec and a compiler which provides both good optimization and options to turn it off partially or completely.
About the documents, they’re too under-objective for my taste. In the performance section, they keep telling what they THINK IS GOOD practice, not what their hardware DO. Intel’s docs do give a lot of suggestions, but they give completely objective specs first.
The technology is really capable, but it’s impossible to harass it completely without exposing enough knowledge to the people.

I really believe this is just a maturity issue. Their CG compiler has come a long way in the past years and has become fairly capable of producing faster code than GLSL, for instance. The thing is that G80 is a totally new ground for them as it is for us, and Nvidia is kind of a newcomer in this general computing world. I wish we had some kind of progress report of bugs being solved and expected release dates for future versions.

I totally agree. We need a better scientific understanding of the hardware operation. The way CUDA is designed, we cannot overlook several hardware implementation issues that may hamper our programs, hoping that the compiler and driver will work their magic for us (which they don’t right now, and may never do).

For now, with no real feedback from Nvidia, I can only hope that all these things are being taken into consideration. I don’t know if this is still just a marketing/competitive issue, not revealing inside specs of their hardware/compiler/driver technology. All I know is that if Intel can do it, I believe Nvidia - a similar capable company - should spend an effort doing it as well. Again, perhaps it is just new grounds to them and they need some time to let the dust settle.

My greatest fear with CUDA (apart from being totally slown down by bugs), is the lack of understanding and clarification on several issues at this time. This leads to us producing code that we learn to make it work in the current generation of both compiler and hardware. If tomorrow Nvidia releases version 1.1 and fixes several issues and/or the next generation of cards remove several hindrances that G80 has, we may have spent months in an effort to fight something that just wouldn’t need to be fought anymore. Our programs and algorithms would simply become outdated (and probably need total rewriting).

Honestly, I would love to cut some slack on Nvidia. They have been doing a wonderful job with their hardware (haven’t heard of better ATI cards for quite some time :) ). Also, CUDA has certainly been a great step in the right direction not only for them but for the industry and entire research community. Being an engineer myself, I’m fully aware of the sheer amount of effort that takes into designing and manufacturing cards like G8X. Not to mention the drivers and toolkits, that are certainly the best out there (really hate ATI shader bugs, crappy compiler and drivers, stupid control panel, etc). We know it takes time to perfect these things and although the “customer is always right”, he does tend to be impatient as well :rolleyes: .

Unfortunately, in the end we - the community - need (and want) to fully grasp this new technology, for all it’s potential and proven capabilities. And for that, we need better documentation, hardware specification, drivers, compilers, and so on. Sorry Nvidia, right now we just need more support on that. That’s all.

Thanks for all the feedback!

All I can say here is that we are aware of the compiler issues and the software team is working hard on the next release.

I would encourage everybody to file bugs on any issues they find, making sure to include complete code that reproduces the problem. This is the only way things will get fixed.

If you are not a registered developer and think you have found a serious bug, please e-mail me or one of the other Nvidia engineers on this forum and we will try and help you out (most of us are reachable as firstInitial+lastName@nvidia.com).

Well… Thanks to Simon.

But a major problem is: mostly when I encounter a problem, I can’t reproduce it in a reasonably simple program. This is likely also osiris’ case.

For example, last time, I wrote a infinite loop to test a condition in a kernel.

int cl,cr;

//blah

while(cl||cr);

The condition held, but the kernel didn’t deadlock and fail on 5s, just returned with a “no error”. And I spent an afternoon debugging elsewhere. After that, I failed to reproduce that without that kernel. That kernel won’t run without 10 other kernels and many megabytes of data, so it’s a no-no to report.

I’m familiar with compilers, if a spec of ptxas is available, I might be able to do something, or at least succeed in creating a small reproducing program to report. Would you please release something like that?

I would be very interested to know what platform you are talking about - linux or windows, AMD64 or IA64.

To second Simon, we are fixing the bugs that we are aware of. Code that reproduces problems is essential to the bug process. We can’t search for potential bugs by writing code from scratch based on general descriptions. Not to single anyone out, but if a problem cannot be isolated/reproduced with a smaller test program, there’s a good chance that the bug is in the app (we’ve seen this with several bug reports). Then again, the compiler isn’t perfect either.

To Osiris: withholding bug code doesn’t hurt just NVIDIA. It hurst the rest of the CUDA community as it only delays possible fixes. It’s easy to list summaries of issues one has run into, it’s harder to create nice repro cases. So, I think it would be great if you provided code that backs up the issues you observed, rather than resorting to “blackmail” tacticks. As Simon mentioned, if you can’t post the code publicly for any reason, you can always email (or send a private forum message) to one of the NVIDIA people.

To asadafag: infinite loops are often optimized out if the compiler detects that the loop does not affect the output (whatever gets written to global memory). This might have been your case, which would explain why the kernel just ran and exited without an error. Another related case would be writing a program that doesn’t write anything to global memory - such programs are often compiled to an empty kernel.

Paulius

Thanks for explaining that. Well, I’d be grateful if that “optimization” would get removed in later versions (though not likely). Basically, that can’t be even called an optimization, since it changes the program’s outcome, and affects correctness (sometimes we do define a crash to be “correct”).

Besides, when someone does empty infinite loops, that would likely be intentional. Empty infinite loop is a great debugging device when only a small number of errors can be returned, like in CUDA, or ACM/ICPC.

I think you’re right about the optimization removal not happening in the future. The thing is that for release code, removing instructions that do not contribute to the output is a valid and useful optimization. It’s not unique to CUDA. For example, Intel’s C++ compiler optimizes away code if it doesn’t affect the output (one example would be a loop that computes the same value over and over, for timing purposes), so you have to “trick” it by calling functions and disabling cross-block optimizations.

Out of curiosity - for what purpose are you using the infinte loop? Maybe there’s an equivalent workaround we can come up with.

Paulius

All the issues mentioned are device problems and were seen using 32 bit mode (32 bit PTX). My platform is Linux x86_64 AMD but that is not relevant to these issues. Perhaps some ULFs are due to the 64 bit driver specifically, but that is not likely.

I would not call my request blackmail - I estimate that it will take more than a man day to prepare repro for all these bugs and I want to see Nvidia willing to spend an equivalent amount of time helping me. I have spent a lot of time already and have canned the project for now as the return in increased performance for time invested is very poor. None of the bugs found were show stoppers, I am confident I have attributed them correctly as I have workarounds for them all. After the above I vectorised all my code using the Intel compiler (recode bottom level loops in Intel’s cannonical form) and attained more than an overall 2x improvement in performance within a few days. Orders of magnitude better productivity. Yes the Intel compiler is buggy too (bad code, vector libs) but it was easy to work around. My particular app is logically complex and comfortably fits into L2 cache (over minutes) on an x86 processor, so does not show off CUDA that well. All I can expect is a quad core running vectorised code is 1/2 a GTX - not worth the torture of redesigning, recoding, debugging and maintaining 2 source implementations. Will look at it again when the G92 is released, as G80 is not commercially useful to me. Likely there will be significant changes that will invalidate what has been done already. We need information about upcoming releases to avoid wasting time.

On the issue of the amount of time spent - do think that Nvidia is arrogant in valuing it’s engineer’s time more than its customer’s. If Nvidia is aware of these compiler issues then there should be an errata web page for CUDA that has all the bugs found or reported and qualified, with suggested workarounds. It would save everyone time and save having to preprare repro for problems that are already known. The strong marketing drive @ NV makes me think that this would be deemed damaging - I think it is more damaging not to have it.

From my perspective the hardware needs fewer threads that run faster and have more resources (on chip shared & registers) to make it more useful for general computing. This week I noticed Intel Research talking about incorporating 3rd party cores into their processors in the future. Seems NV MPs are a contender and since the buyout of Nvidia by Intel did not materialise after the market rumors earlier this year - is anything happening? How soon?

The NV MP core certainly is so much more flexible than the Intel SSE/MMX facilities and development could be so much more productive if it was integrated this way.

NV should take a look at the coprocessor communication facilities (DMA) in the Cell to see how they should be organised. NV are held back by the poor IBM PC architecture, still so much more could have been done.

Eric

Well… the thing is, infinite loop does contribute to output (i.e. it prevents output from being generated at all). Even Intel compiler doesn’t optimize that out, I think.

My use of infinite loops, well, is sort of a tradition. Come to think of it, CUDA is surprisingly like online judge systems used in ACM/ICPC training: you submit a bunch of code, and get an obscure result: Accepted(correct result)/Wrong Answer(incorrect result)/Time limit exceeded(the launch timed out). Since once you get an “Accepted”, you no longer need debugging (mostly), the ability to choose between WA and TLE (which usually involve infinite loops) becomes really useful. Back in the old days, some crazy guy even downloaded an entire dataset via infinite loops. So at least to me, “optimizing” them out is rather a pity for a compiler.

Just one more thing, has nVidia tested whether their "optimization"s make anything run faster at all?

I see. You could probably write a infinte loop that tricks the compiler, it wouldn’t be something as simple as “while(1);”. If you really need it, you could try something like:

volatile int x=0;

while(x!=1)

    x=global_array[0];

You just have to make sure that global_array[0] is not set to 1. Though I must say I am still skeptical about the applicability of infinite loops in practice. If you don’t want output, there are other ways to achieve that (such as not writing anything to global memory).

Yes, optimizations are tested. For example, one optimization reduced execution time for one of my kernels nearly in half.

As a side note, good to see references to the ACM ICPC - I was both a team member (UCF) and later a coach. To be fair to the contest, TLE often indicated that a brute force solution wasn’t satisfactory.

Paulius

Well, thanks for that…
The thing is, what you’ve said is, your case. And unfortunately, I may be of the exactly opposite style. For example, I like to use WA/TLE to get information of judge’s machine and the test data (in desperate situation), I like to write uber-hand-optimized kernels, I like to exploit strict float point arithmetic.
I know I’m a strange person, perhaps a lunatic. But at least, including uber-hand-optimized kernels in optimization testing may be worth the effort. Optimization tends to stand in the way of such kernels, and they can achieve more performance than even the processor vendor’s compiler can optimize a plain kernel.
By the way, a quick and dirty way around all those issues could be supporting in-line ptx code, and make the -O0 option of ptxas actually work. That way, everyone will be happy.

I understand. Unfortunately, hand-optimizing is practical for only the smallest pieces of code, due to its low productivity. Also, it doesn’t accomplish as much as it used to (back in the day), since majority of CPUs execute instructions out of order. For example, P4 or C2 further optimizes your hand-optimized code by reordering instructions.

Paulius

Yes, but for P4 or C2 we have a reasonable understanding of what’s available at the hardware level and a model of why things work the way that they do - so recently, I managed to get a 3-4x improvement through a careful combination of unrolling, pipelining and operation selection (as well as carefully choosing algorithms to fit what was available at the low level). It helped a lot to have Appendix C (Instruction Latency and Throughput) of the “Intel 64 and IA-32 Architectures Optimization Reference Manual” up on the other monitor - not just for final tweaks, but during the design process as well.

This kind of tweaking (in this case) never involved any hand-written assembly language, just source-level optimizations and a lot of reading of assembly and plenty of performance measurement. It fell much closer to the ‘coding in C’ level of difficulty/productivity than it did to the ‘coding in asm’ level.

I hope that at some point NVIDIA management decides that a similar level of documentation is appropriate, or at least reduces the level of secrecy about hardware details. One wonders how well received a new general purpose processor would be if its documentation told us that ‘you should try to achieve temporal locality in the L1 cache, which is 16K’, without providing any other information about associativity, line size, latency, miss policy, etc.

At risk of sounding like a broken record on this, even a simple reenabling of the text-format ptxas output (cubin in ‘assembly-like form’) would go a long way to helping us extract better performance from G8x and its successors (and it’s not like we’re going to storm NVIDIA’s headquarters with torches when details change on subsequent architectures, we’re all grownups, sorta).

Geoff.

I agree with the sentiments above. The documentation for CUDA at present really really holds it back for GPGPU.

From a beginners perspective there lacks a proper description of exactly how threads work (a full worked tutorial with a bit more depth than MatrixMul is required) to allow new users to quickly learn how to program CUDA.

From a more advanced users point of view we need to now what is going on in the background what the actual architecture we are developing on is doing and how our code is being optimised. Without that we will never achieve the fastest code. People are used to programming for x86, we now expect the documentation to be as good as the x86 documentation is and the documentation for something like gcc.

It would also be helpful if a function had been deprecated from one version to the other if that was documented so you could work out why your code wasn’t working (i’m thinking CUDA_DEVICE_INIT or whatever it was between 0.8 and 1.0).

Chris