Running Windows 7
I have followed the quick start instructions at
file:///C:/Program%20Files/NVIDIA%20GPU%20Computing%20Toolkit/CUDA/v9.2/doc/html/cuda-quick-start-guide/index.html
When I rebuild, I see errors
in release mode, errors are of the form
Error 31 error LNK2038: mismatch detected for ‘_MSC_VER’: value ‘1600’ doesn’t match value ‘1800’ in cdp_lu.cu.obj C:\ProgramData\NVIDIA Corporation\CUDA Samples\v9.2\6_Advanced\cdpLUDecomposition\cublas_device.lib(sgemmEx.obj) cdpLUDecomposition
in debug mode, errors are of the form
'14>cublas_device.lib(sgemmEx.obj) : error LNK2038: mismatch detected for ‘_MSC_VER’: value ‘1600’ doesn’t match value ‘1800’ in cdp_lu.cu.obj
14>cublas_device.lib(sgemmEx.obj) : error LNK2038: mismatch detected for ‘_ITERATOR_DEBUG_LEVEL’: value ‘0’ doesn’t match value ‘2’ ’
Do I need to rebuild cublas_device.lib and, if so, how?
Howard Weiss
Running on Windows 10, using Visual Studio 2015 Update 3. Same problem here. It seems cublas_device.lib is compiled by MSC_VER=1600 (Visual Studio 2010)? This is a wired bug.
Does anyone know how to resolve this? Otherwise, we cannot call cublas in kernels.
Thanks,
Kaiwen
NVIDIA is aware of this issue.
It will not be fixed.
the cublas device functionality is deprecated in the CUDA 9.2 toolkit and will be removed from a future toolkit release.
[url]https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#deprecated-features[/url]
It’s recommended that you begin modifying codes to not depend on this functionality if you want to maintain them with future toolkits. It will not be possible to maintain cublas device functionality with future toolkits.
If you don’t wish to do that, then it’s suggested that you revert to CUDA 9.1, or switch to VS 2010
Thank you for reply, txbob.
For the first view, This dynamic parallelism sounds like a perfect solution to put as much as control logics to GPU and eliminates as much as kernels launched from host and GPU-CPU sync. Theoretically, we can put any single thread critical path to CUDA by launching <<<1, 1>>> and use this path to launch second level data-parallel kernels. However, practically, we found that after enabling rdc, second level kernels become much slower. I am curious why this happens and what is the main difficulty behind this good story.
Thanks,
Kaiwen
rdc prevents the compiler from making certain optimizations it might otherwise make
it’s not uncommon for code to run slower with rdc
beyond that, it would be necessary to inspect a specific case.