strange cuda build problem [SOLVED]

I have some code that i’ve been developing on a linux machine with a 780M (compute capability 3.0) using cuda 8.0 on fedora 23 and 375.39 driver. When running on this dev machine all the code in the project runs perfectly.

I’ve just come across a strange issue whereby one particular kernel in the project will fail with no errors what so ever, when running on a GTX1070 (compute capability 6.1 pascal version) on ubuntu 16.04 with the 375.39 driver if I specify compute capability > 5.3 or if I do not generate PTX.

I’d really like anyones input as to what on earth could be going on here. what am i doing to make this one (of many) kernel fail?

This seems to be a contradiction in itself. What constitutes failure, if no errors occur (either error status reported by CUDA or erroneous results in output data)?

Based on the information provided, there are two possibilities here:

(1) There is a latent bug in your code that is exposed by building for different architectures
(2) There is a bug in NVIDIA-provided software, e.g. compiler or driver

Out of the two possibilities, (1) is more likely. To narrow it down, rigorously check status of all API calls and kernel launches. Run application under control of CUDA memcheck. Use the debugger, code instrumentation, code bisection, and other standard debugging techniques to track down the source of the problem. Since we do not know what the code is, there is also a possibility that the root cause is outside of device code, i.e. somewhere in the host code.

While I greatly appreciate the coding 101 lesson, indeed i have; as I always do, run every check and debug tool under the sun and ALL cuda api calls in the code are checked for their return status. cuda-gdb has been used to meticulously step through every line of code, cuda memcheck et al have all been run. The same is true of the host code with gdb, valgrind et al. There are ZERO errors, mem leaks etc. I hope that puts an end to any further lessons on basic software development.

I’ll try to be a little more specific so I don’t sound so much like I have no idea what I’m doing.

Regarding the “will fail with no errors” comment. Only when running on a 1070, if I compile for > 5.3 not one single line of code in the kernel in question is run but a cuda check after the kernel launch always returns success. The same is true if I don’t compile with PTX.

The kernel in question runs an image processing filter. The filter runs perfectly on a 780M (Compute Capability 3.0) and a Titan X (Compute Capability 5.2).

Regarding point 2, I’m not presuming or asking anyone if the think there is a bug in the compiler or driver. I’m trying to figure out what the problem might be conceptually.

With regard to hardware/architecture compatibility, there are at least the same or more resources on a 6.1 device than a 780 or titan x. More shared mem, more registers per thread etc and on the 1070 in question, there is more global memory (which is irrelevant since the kernel uses a constant hundred or so MB’s regardless of image size). The number of registers used, shared mem per block, etc etc are all within stated limits.

In case anybody else has a similar issue, the problem turns out to be that for any compute capability > 3.x nvcc (the version I have with nsight 8.0) decides to assign 121 registers to this particular kernel that is launched with 1024 threads which is clearly impossible on any current architecture.

Using launch bounds directives it assigns 32 registers to that kernel and now runs just fine on the 1070.

I have to say that i’m surprised nvcc couldn’t figure out that the number of registers per block was impossible when the kernel is launched with the number of threads per block given as a const.

If you didn’t get a launch configuration error that would still make it a bug in the driver.

nvcc can’t figure it out because of it’s independent compilation approach - host code doesn’t know the number of registers required (which could differ between architectures), and device code code doesn’t know the number of threads per block for the launch. On the plus side, fat binaries allow for easy support of multiple architectures without JIT-compiling the code each time.

definitely no launch configuration error but then I’m reluctant to conclude thats a bug in the driver because I think I must be doing some very strange things.

With the launch bounds now specified, I’m getting a completely UNspecified build failure if device debugging is turned on and I specify a particular ptx or device architecture. If I don’t have device debugging or I dont specify the architecture it compiles and runs just fine. But the compiler tells me that if i don’t specify an architecture for device or ptx then it defaults to sass 2.0 but then when i specifically compile for 2.0 it says i’m using too many threads per block which is true.

If I have to put guards around the launch bounds just to allow device debugging I think that would be pretty odd. But then it is apparent that i understand all this stuff far less than I thought. Things just aren’t making much sense anymore to me now as to how it all works.

Bear in mind that compiling to ptx and relying on the driver for runtime compilation may lead to different register counts (which you won’t see) than when you compile to device-specific code. Compiling to ptx only will also hide any error messages normally emitted by ptxas. Both of which can be rather confusing.

I did not know that. Thanks for the advice. Seems like I have some reading to do.

As for the device debug issue, does it seem sensible that with device debugging turned on the build would fail if I specify a min num blocks in the launch bounds greater than 1?

Turning device debugging on will disable device-side optimization. As nvcc by default builds with full device-side optimization turned on, it exercises codepaths that are less well tested. Also register pressure may be higher (although often lower) without optimization. In that sense, yes, it seems reasonable that device-side debugging may expose problems otherwise hidden.

OK. Thanks tera. It doesn’t make sense to me that the build would fail with no actual error specified, just a generic ‘build failed for some_module.o’. Particularly when I can turn off optimisation and as long as i don’t have device debugging turned on the build completes and kernel runs, and when device debugging is turned on the number of registers per thread and per block are still within device limits. Famous last words I guess. I didn’t notice the total number of threads per block was too high last time which lead to the original issue so it’s likely something similar I’m not seeing.

It is fairly apparent that my understanding of the compilation process is terrible and all my problems have been related to that so thanks for your advice. I’ll go enlighten myself on the details of nvcc et al.

I’d be willing to bet that your post-kernel error checking methodology is flawed, if you ran into a registers-per-thread issue that prevented kernel launch, but, as you claim “but a cuda check after the kernel launch always returns success.”

proper post-kernel-launch error checking methodology is covered here:

The type of error associated with insufficient resources for launch issues is a non-sticky error. Therefore, to capture this, it is necessary to use either cudaPeekAtLastError or cudaGetLastError, before any additional cuda runtime API calls are made.

Thanks txbob for making; as njuffa did, incorrect and completely unhelpful presumptions. No my post-kernel error checking was not flawed. I can read and I have read the cuda documentation including the parts on error checking and I’m fully aware of all the ways in which cuda api calls can and should be checked. I not only used the correct error checking, but I also re-worked the kernel in question countless different ways with different types of streaming and block and non blocking etc.

Regarding your statement that 3.7 can support 128K per block, maybe we are reading different documents? The version I’m reading says 128K per SM and 64K per block…

Please people, no more posts with presumptions about my code being wrong.

I thought it was quite clear that tera has 100% pinpointed the problem and that my understanding of nvcc in particular and the cuda compilation process in general made me PRESUME loads of things that just were not correct.

So yes, the problem was entirely my own, everything has now been rectified, the kernel compiles and runs perfectly again with exactly the same code I had before, error checking and all.

Funny how presumptions have a way of causing problems.

Yes, you’re correct, it’s 64K registers per block, not 128 as I previously said (which I’ve edited to remove).

My initial answer was based on the minimal amount of information provided. What little information was provided wasn’t actionable. That required me to start with some very basic suggestions.

There are an infinite number of ‘answers’ that you could have provided. Personally i don’t see how your presumption that i don’t know how to debug etc is the logical conclusion from my original post or in general from ‘minimal information’. In fact, now that I have a better understanding of the cuda compilation process, it seems to me that my original post about needing ptx etc was a very significant indication of the problem.

Thank you for reminding me that I should have immediately asked for a minimal, complete, and verifiable example ( before offering any comments.