CUDA code working on <=Volta, not working on RTX Titan

blade613x · December 21, 2018, 4:37pm

This is an issue I don’t think I’ve ever had in all my years of CUDA computing.

I have CUDA code that I call through custom Python framework which runs fine on all GPUs except a RTX Titan. The code runs perfectly fine on Titan Volta, and previous GPUs like the GTX 1080Ti, Quadro M1200, etc. But when I try running it on the RTX Titan, I see the GPU memory loaded (so it at least transfers data to the GPU), and then it crashes within seconds afterward. Here’s a summarized traceback of the error output:

From Python’s terminal:

An error ocurred while starting the kernel
GPUassert: 4 unspecified launch failure cu_sync.cu 11

When I run CUDA-MEMCHECK on the app, I get the following output…

GPUassert: 13 CUBLAS_STATUS_EXECUTION_FAILED cublas_gemm.cu 233
========= Program hit cudaErrorInvalidValue (error 11) due to "invalid argument" on CUDA API call to cudaFuncSetAttribute.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:C:\Windows\SYSTEM32\nvcuda.dll (cuD3D9UnmapVertexBuffer + 0x2e2c85) [0x2f105b]
=========     Host Frame:C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0\bin\cublas64_100.dll (cublasGemmStridedBatchedEx + 0x20b41) [0x460651]
=========     Host Frame:C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0\bin\cublas64_100.dll (cublasGemmStridedBatchedEx + 0x56d5) [0x4451e5]
=========     Host Frame:C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0\bin\cublas64_100.dll (cublasCreate_v2 + 0x40e) [0x10253e]
=========     Host Frame:C:\Users\User\Desktop\..\pycu_interface\cublas_helpers\lib\cublas.dll (cublas_init + 0x27) [0x161f7]

And so on… it prints about 30 more of these. The last one looks like:

========= Program hit cudaErrorInvalidValue (error 11) due to "invalid argument" on CUDA API call to cudaGetLastError.
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:C:\Windows\SYSTEM32\nvcuda.dll (cuD3D9UnmapVertexBuffer + 0x2e2c85) [0x2f105b]

While I don’t have any other RTX GPUs to test if this is an isolated issue with that generation of cards, Volta, Pascal, and Maxwell generations of cards work without error.

I’ve only been able to test the RTX Titan hardware itself in some gaming benchmarks, where it worked fine, and performed as expected (slightly better than the Titan V). I will test an OpenCL app this weekend.

I’ve tried reinstalling the latest drivers (417.35), the latest CUDA Toolkit (10.0). I’ve tried recompiling my libs from sm_30 to sm_75 as well.

I am running Win 10 Pro right now, and will get around to testing Linux sometime early next year to see if this is another random CUDA+Windows issue as well, but this is unlike others I’ve encountered before. Usually such an error/issue was consistent across different GPUs, unlike the one I am facing right now.

Any help much appreciated, thanks!

Robert_Crovella · December 21, 2018, 5:02pm

often the best thing to do in these cases is to develop a small self-contained reproducer code, and then file a bug at developer.nvidia.com

tripathy.sushant · December 27, 2018, 9:11pm

Hi there, is there any follow-up regarding what’s causing this issue? I am also having trouble running CUDA (Toolkit v10) code on Titan RTX (albeit they work without any hiccups on RTX 2080).

Topic		Replies	Views
Program hit cudaErrorInvalidValue (error 11) due to "invalid argument" on CUDA API call to cudaGetLastError CUDA Programming and Performance	4	2991	August 23, 2019
RTX 2080 Ti cuda-memcheck hit error at the beginning of creating Cublas context CUDA Programming and Performance	0	593	January 31, 2019
CUDA 2.1 Beta Problem/Bugs (Linux) CUDA Programming and Performance	5	1669	January 6, 2009
Titan RTX memory access error/malfunction/bug Linux	0	601	March 6, 2020
ptxas died due to signal 11 (Invalid memory reference) CUDA Programming and Performance	2	3312	September 3, 2009
Cuda libraries have memory errors CUDA Programming and Performance	0	3593	August 20, 2011
Any detail on cudaErrorInvalidValue(11)? CUDA Programming and Performance	7	16407	May 27, 2014
CUDA_ERROR_INVALID_VALUE in different architectures CUDA Programming and Performance	5	1120	January 10, 2014
nvcc error : 'ptxas' died due to signal 11 (Invalid memory reference) CUDA Programming and Performance	8	5098	March 12, 2014
Memory errors on Tesla K20c, GTX Titan (but not on GTX680) Linux	0	1038	June 11, 2014

CUDA code working on <=Volta, not working on RTX Titan

Related topics