I am developing a CUDA program and have faced with many problems with it. Starting from most important.
Nsight / Start CUDA debugging doesn’t work. Debugger doesn’t stop on breakpoints in CUDA kernels. I got messages in Output like
“CUDA context created : 23ecbd80090
CUDA module loaded: 23ed75fcf00 cudaDe.cu.obj
CUDA grid launch failed: CUcontext: 2468731158672 CUmodule: 2468924608256 Function: _Z15initCalculationP14TDistrictState
CUDART error: cudaLaunch returned cudaErrorLaunchFailure
CUDART error: cudaMemcpy returned cudaErrorLaunchFailure”
NVIDIA Visual Profiler fails to get advanced analysis information about my kernels. Errors are usually “unspecified”, “insufficient data” and so on http://prntscr.com/dsxn11 . The best I was able to get are some individual results when I made executable exiting after some iteration. Execution timeout in Profiler settings, pressing cancel – any of them breaks such data getting.
doesn’t help. Debug/release doesn’t make much difference.
If I start second instant of the program, host memory usage increases a lot and program starts consuming CPU. E.g. instead of 500 MB in becomes 10 GB for instance 1 and 18 GB for the instance 2 (why 18? not 8 or 16). Consuming here means about 100% of one core. One instance sometimes crashes (unspecified kernel launch error). The program itself doesn’t use much RAM nor CPU, below 1 GB at GPU and host.
After several experiments single instance started to occupy 18 GB, but I believe I saw existing instance decreasing its memory usage back after closing the second instance.
Compilation of .cu file is very slow in debug version (several minutes). It slows down after several messages like
1> ptxas info : Function properties for _ZN74_INTERNAL_52_tmpxft_00002584_00000000_7_cudaDe_compute_61_cpp1_ii_a93c74ca5isnanEf
1> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
CUDA initialization in .exe is very slow (several minutes). This fixes if I include “compute_61;sm_61” in VS project’s CUDA settings.
Compiler doesn’t detect change in included .h and .cu files – solution build doesn’t recompile .cu.
If I use GPU 1 for my calculations, long-executing kernels (1-2 seconds) and increase TDR timeout e.g. to 10 seconds, TeamViewer sometimes stops responding. I fixed this by switching calculations to GPU 2. All other problems reproduce on it too.
This is a genetic programming task, developed with Visual Studio 2013, I can send you debug or release exe and a small dataset to run it. I run it on a test machine through TeamViewer. 6-core intel core i7-5930K, 32 GB RAM, 2 GeForce GTX 1080 8 GB RAM cards, Windows 10 Pro 64-bit, DirectX 12, NVIDIA drivers 369.30, CUDA Toolkit 8.0.
I’m not very experienced with CUDA, maybe about 2 working months in total, but experienced in programming in whole (about 15 years). The current program version is primitive and very unoptimal, I am optimizing it now. It can also contain memory access errors. But I suppose the tools have to work with it too or at least report more specific error.