strange behavior on 2080 Ti

I meet a very strange behavior on 2080Ti.
In my project, I use llvm to generate IR, and output ptx file by NVPTX, and use cuda API load the ptx file, and do the launch kernel.
The problem is that, my program run incorrectly only on 2080Ti, and It works well on P100, 1080.
Then I want to debug my program, and Insert Printf instruction on the llvm IR, then strange thang happens, it works well on 2080Ti after I insert the Printf instruction.

OS : ubuntu 16.04
cuda driver : 415.27
cuda runtime : 9.0
sm arch : sm_60

looking forward to any replay.


GPUs do not provide binary compaibility like x86 CPUs. You need to build for the correct GPU architecture.

P100 and 1080 are in the Pascal family with architecture sm_6x. The 2080 Ti is in the Turing family, which would be architecture sm_7x. Consult documentation to find out the exact architecture (i.e. the correct value of ‘x’).

[Later:] Wikipedia’s list ( says RTX 2080 Ti is compute capability 7.5, so sm_75 is what you would want to specify for compilation.

Thanks for your reply.
yes, GPUs does not provides binary compatible, but It should be compatible for ptx file.
In my program, I provide ptx file, so I thinks this is not a problem.

PTX is an intermediate compiler representation and a virtual ISA. PTX code is compiled to machine code, at which point the target architecture needs to be specified. It isn’t clear to me how that works in your setup, but I did notice that you specified an architecture (sm_60) in your question which does not match your GPU.

Issues that stop manifesting after the addition of printf() calls could be race conditions in the code, or access to uninitialized or out-of-bounds data. Adding printf() could also mask a compiler bug, but as compiler bugs are rare these days compared to the plethora of possible programmer bugs, you might want to check your code. It is possible for bugs to manifest only on one architecture but not on a different one.

My guess is that your code may be relying implicitly on synchronous execution of all threads in one warp. This is an assumption that no longer holds on Volta and Turing. The program counter of different threads belonging to the same warp may diverge here.

Thanks for your reply.
At the first time, I also afraid there may be a hidden bug as you described.
So the did this test:

  1. reducing the size of input data into just 10 rows.
  2. setting the size of gird and block to 1
    It means that I let the cuda just runs in one thread.
    I found that the program still produces incorrect output.
    but if I insert printf instruction, the output become correct.

what a strange problem!!!

looking forward to any reply.

If there is a bug in CUDA compilation, it’s unlikely to get fixed or worked on based on your discussion so far, unless you can provide a simple, complete, self-contained reproducer. If you can do that, you may wish to simply provide it here, or perhaps even better file a bug. Instructions for filing bugs are linked in a sticky post at the top of this forum.

Thanks for your reply.
I have fix the problem by upgrading cuda driver 415.27 to 418.56