Cannot run example "even easier introduction to CUDA"

I’m trying to run this. I’m on ubuntu 16.04, with 2x 1080TIs and a 1030, cuda 9.1.

Initially, running with cuda-memcheck resulted in this error:

“Program hit cudaErrorInvalidConfiguration (error 9) due to “invalid configuration argument” on CUDA API call to cudaLaunch”

After reading stuff online I thought setting COMPUTE_PROFILE=0 might help, now I’m getting a different error (and unsetting the variable does not help):

“Internal Memcheck Error: Memcheck failed initialization as some other tools is currently attached. Please make sure that nvprof and Nsight Visual Studio Edition are not being run simultaneously”

What is going on? Is the example out of date?

Works fine for me (see below): CUDA 8, Win 7, Quadro P2000. Note that this program does not use any status checking, presumably to prevent the code from becoming cluttered. I would suggest adding that in. There may be an error long before the code gets to the kernel launch.

Running this app under cuda-memcheck crashes the display driver on my system, presumably due to hitting TDR (Windows’s two-second GUI watchdog timer) although it is not clear to me why this would be the case.

c:\Users\Norbert\My Programs>nvcc -arch=sm_61 -o cudamanaged.exe
nvcc warning : nvcc support for Microsoft Visual Studio 2010 and earlier has been deprecated and is no longer being maintained
support for Microsoft Visual Studio 2010 has been deprecated!
   Creating library cudamanaged.lib and object cudamanaged.exp

c:\Users\Norbert\My Programs>cudamanaged
Max error: 0

c:\Users\Norbert\My Programs>nvprof cudamanaged.exe
==6500== NVPROF is profiling process 6500, command: cudamanaged.exe
Max error: 0
==6500== Profiling application: cudamanaged.exe
==6500== Profiling result:
Time(%)      Time     Calls       Avg       Min       Max  Name
100.00%  251.00ms         1  251.00ms  251.00ms  251.00ms  add(int, float*, float*)

==6500== Unified Memory profiling result:
Device "Quadro P2000 (0)"
   Count  Avg Size  Min Size  Max Size  Total Size  Total Time  Name
    2048  4.0000KB  4.0000KB  4.0000KB  8.000000MB  21.04488ms  Host To Device
     384  32.000KB  32.000KB  32.000KB  12.00000MB  7.825223ms  Device To Host

==6500== API calls:
Time(%)      Time     Calls       Avg       Min       Max  Name
 41.68%  287.57ms         2  143.78ms  4.2846ms  283.28ms  cudaMallocManaged
 36.42%  251.23ms         1  251.23ms  251.23ms  251.23ms  cudaDeviceSynchronize
 13.47%  92.920ms         1  92.920ms  92.920ms  92.920ms  cuDevicePrimaryCtxRelease
  7.53%  51.971ms         1  51.971ms  51.971ms  51.971ms  cudaLaunch
  0.72%  4.9578ms         2  2.4789ms  1.7708ms  3.1870ms  cudaFree
  0.11%  773.23us        91  8.4970us       0ns  365.06us  cuDeviceGetAttribute
  0.06%  422.53us         1  422.53us  422.53us  422.53us  cuModuleUnload
  0.00%  15.540us         1  15.540us  15.540us  15.540us  cudaConfigureCall
  0.00%  10.849us         1  10.849us  10.849us  10.849us  cuDeviceTotalMem
  0.00%  4.9850us         3  1.6610us     294ns  4.1050us  cuDeviceGetCount
  0.00%  3.2260us         3  1.0750us     587ns  1.4660us  cudaSetupArgument
  0.00%  1.4660us         3     488ns     293ns     879ns  cuDeviceGet
  0.00%  1.1720us         1  1.1720us  1.1720us  1.1720us  cuDeviceGetName

You’re using CUDA 8, I’m using CUDA 9, you’re using Windows 7 (why are you using win7?!?), I’m on ubuntu, I’m using GTX1080TIs, you’re using a quadro P2000s. It’s hard to imagine a more different configuration. But thanks for the tip on error checking, how does one do that?

I am using what I have in front of me. Yes, my setup is quite different from yours, but the quick check shows that there is nothing fundamentally wrong with the code from the blog. That’s about as much work as I am willing to do for free.

Your favorite internet search engine will return pages of relevant information at the click of a button.


It’s possible that there is a bug in CUDA 9.x, but given that the code from the blog is a trivial example, I would expect any such bug to be found during regression testing and never make it into a CUDA or driver release. Is Ubuntu 16.04 on the list of officially supported operating systems for CUDA 9.1? Are you running with the stock kernel for Ubuntu 16.04?

Are you certain that you are running the code in the blog verbatim, and have not changed the n variable

  • is n set to 256 in your code?

If n is set to 256 in your code, have you verified your CUDA install? Instructions for verification are in the cuda linux install guide and amount to building and running a few sample projects such as vectorAdd

Don’t know what happened but it works today. I’ve had CUDA installed for 6+ months and use it almost everyday, but never needed to actually write any CUDA code until now. Thanks anyways!