C++ CUDA DLL works on C2050, but fails on GeForce GTX Titan

I have a CUDA DLL that is launched from Excel. It works fine on a C2050 card. However, when I switched to a GeForce GTX Titan card, the kernel execution fails (no error messages, but no results calculated). The card itself should be OK, since it runs all the other Nvidia samples just fine. I tried to debug the kernel code with Nsight, but the breakpoints were never hit. Again, I can use Nsight to debug the kernel code if I use C2050.

I did use “compute_20,sm_20;compute_30,sm_30;compute_35,sm_35;compute_37,sm_37;compute_50,sm_50;compute_52,sm_52;” for the code generation.

Any help or suggestions will be highly appreciated.

Your kernel is not running for some reason. Have you used proper cuda error checking throughout your CUDA code? That means testing the return value of every CUDA API call, and testing the kernel for errors after the kernel launch. If you’re not sure how to do proper cuda error checking, google “proper cuda error checking” and take the first hit.

I’m pretty confident that with proper error checking and display of such reported errors, you’ll have a better idea of why the code is not running.

The error message I got is “too many resources requested for launch” or “Invalid argument”. Somehow I got the DLL to run by lower the thread number per block from 512 to 128. I don’t quite understand how the thread number is an issue for GeForce Titan and K80, but not for C2050. In my case, I launched 1024 threads. Now since I have 128 threads per block, so 8 blocks are launched.

The PTXAS compiling information is:

1> ptxas info : 1560 bytes gmem, 464 bytes cmem[3]
1> ptxas info : Compiling entry function ‘Z6gpuRunPN7NumeriX19SimulationMethods_IEPdS2_PNS_15SimulationIndexEPNS_20SimulationIndexArrayEPiS7_PbS8_S7’ for ‘sm_35’
1> ptxas info : Function properties for Z6gpuRunPN7NumeriX19SimulationMethods_IEPdS2_PNS_15SimulationIndexEPNS_20SimulationIndexArrayEPiS7_PbS8_S7
1> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
1> ptxas info : Used 191 registers, 400 bytes cmem[0], 304 bytes cmem[2]
1> ptxas info : Function properties for __internal_accurate_pow
1> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads

So the reason it fails on Titans is I used more than 65536 registers per block.

That sure are a lot of registers being used per thread. I assume this is double-precision code (where each ‘double’ variable requires two 32-bit registers)? You could try using __launch_bounds() on the relevant kernel(s) to reduce register use per thread somewhat. If you can post the kernel code, it might be possible to recommend a few strategies for reducing register pressure. You might also want to check the Best Practices Guide.

What CUDA version are you using? At this point, I would recommend version 7.5, as there do not seem to be any significant issues with it. Version 6.5 is fine but does not include the latest improvements. Version 7 had a number of compiler and library issues associated with it, my recommendation is to avoid it.

I use CUDA 7.5. For sm_20, the compiler uses 63 registers. However, with sm_35, it uses 191 registers. Not sure which variables were moved into registers, though.

My code is not written for GPU originally (lots of user-defined classes with members of pointers, for example). Won’t make much sense to post it at this point. But it’s good to know that registers being used could vary wildly with compute capabilities. I could assign a max number of registers, though.

Thanks for the great help!

A typical fix would be to put launch bounds:


in your code, as njuffa has already indicated, specifically in front of whatever kernel this is:


If you want to run 1024 threads per block on that kernel, you should use the launch bounds to limit the compiler to 63 registers per thread, or something like that.

Yes, adding launch_bounds in front the kernel is a great idea. That guarantee the maximum number of registers is not exceeded, no matter what the compute capability is. Thank you both for the great help!

It is entirely possible for the same HLL source code to result in machine code with very different register usage, dependent on machine architecture. CUDA source code is initially translated into an intermediate, platform-independent, code called PTX. This code is then further compiled to machine code with a tool chain component called, somewhat misleadingly, ptxas.

The compilation of PTX to machine code uses machine (architecture) specific code transformations, in particular for register allocation, instruction selection, and instruction scheduling. For GPU architectures with increased register file size, the relevant heuristics will enable code transformations that generally will improve performance at the cost of additional registers used.

Overall, that is a good thing, as available hardware resource are utilized fully to maximize performance. Occasionally, though, this can backfire, leading to a register use explosion. This could constitute a bug, or could simply be a limitation of the chosen heuristic. It is impossible to tell which it is here without knowledge of the code. All heuristics generally have the property that they can deliver good or acceptable results for the large majority of use cases, but will deliver sub-optimal results for a small number of use cases.