errors building CUDA 9.2 using Visual Studio 2013

Running Windows 7

I have followed the quick start instructions at


When I rebuild, I see errors

in release mode, errors are of the form

Error 31 error LNK2038: mismatch detected for ‘_MSC_VER’: value ‘1600’ doesn’t match value ‘1800’ in C:\ProgramData\NVIDIA Corporation\CUDA Samples\v9.2\6_Advanced\cdpLUDecomposition\cublas_device.lib(sgemmEx.obj) cdpLUDecomposition

in debug mode, errors are of the form

'14>cublas_device.lib(sgemmEx.obj) : error LNK2038: mismatch detected for ‘_MSC_VER’: value ‘1600’ doesn’t match value ‘1800’ in
14>cublas_device.lib(sgemmEx.obj) : error LNK2038: mismatch detected for ‘_ITERATOR_DEBUG_LEVEL’: value ‘0’ doesn’t match value ‘2’ ’

Do I need to rebuild cublas_device.lib and, if so, how?

Howard Weiss

Running on Windows 10, using Visual Studio 2015 Update 3. Same problem here. It seems cublas_device.lib is compiled by MSC_VER=1600 (Visual Studio 2010)? This is a wired bug.

Does anyone know how to resolve this? Otherwise, we cannot call cublas in kernels.


NVIDIA is aware of this issue.

It will not be fixed.

the cublas device functionality is deprecated in the CUDA 9.2 toolkit and will be removed from a future toolkit release.

It’s recommended that you begin modifying codes to not depend on this functionality if you want to maintain them with future toolkits. It will not be possible to maintain cublas device functionality with future toolkits.

If you don’t wish to do that, then it’s suggested that you revert to CUDA 9.1, or switch to VS 2010

Thank you for reply, txbob.

For the first view, This dynamic parallelism sounds like a perfect solution to put as much as control logics to GPU and eliminates as much as kernels launched from host and GPU-CPU sync. Theoretically, we can put any single thread critical path to CUDA by launching <<<1, 1>>> and use this path to launch second level data-parallel kernels. However, practically, we found that after enabling rdc, second level kernels become much slower. I am curious why this happens and what is the main difficulty behind this good story.


rdc prevents the compiler from making certain optimizations it might otherwise make

it’s not uncommon for code to run slower with rdc

beyond that, it would be necessary to inspect a specific case.

Thanks for reply, txbob.