CUDA GDB hang on cudamalloc(), single GPU

Machine and software:
Dell precision T7500 with a Geforce GTX Titan X GPU, driver version 352.63, running Ubuntu 14.04, using Nsight Eclipse v7.5 and Cuda compilation tools v7.5.17, in a single GPU setup

Problem:
On the first call of cudaMalloc or cudaFree, Nsight hangs and the computer must be rebooted. After searching on Google, and reading this: https://devtalk.nvidia.com/default/topic/546956/ and this: https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-understand-fat-binaries-jit-caching/, the likely cause is that the JIT compiler causes the delay at the context creation, e.g when the first call to cudaMalloc or cudaFree(0) is made. There seems to be an indefinite delay, we have tried to wait 30 minutes without it resolving. We have also tried to disable the JIT compilation altogether by going to Project/Properties/Build/Settings/CUDA and checking “5.2” next to “Generate GPU code”, and not checking anything next “Generate PTX code”, but Nsight stills is hanging the computer. The command line for the compiling is:

Building file: …/src/Class/Class.cu
Invoking: NVCC Compiler
/usr/local/cuda-7.5/bin/nvcc -G -g -O0 -gencode arch=compute_52,code=sm_52 -odir “src” -M -o “src/Class/Class.d” “…/src/Class/Class.cu”
/usr/local/cuda-7.5/bin/nvcc -G -g -O0 --compile --relocatable-device-code=false -gencode arch=compute_52,code=sm_52 -x cu -o “src/Class/Class.o” “…/src/Class/Class.cu”
Finished building: …/src/Class/Class.cu

Building target: Prog
Invoking: NVCC Linker
/usr/local/cuda-7.5/bin/nvcc --cudart static --relocatable-device-code=false -gencode arch=compute_52,code=sm_52 -link -o “Prog” ./src/Prog.o ./src/Class/Class.o -lGL -lGLU -lglut
Finished building target: Prog

When the GPU is not in single GPU mode (e.g not running X) this problem dissapears.

Were you able to find a solution to this? Its annoying to not be able to debug dynamic paralleled code. I am logging in to a remote box which is not running X and it is still stuck. Here is the stack trace from my application:

#0  0x00002aaaaaee5c6d in sendmsg () from /lib64/libpthread.so.0
#1  0x00002aaaac013a48 in cudbgApiDetach () from /usr/lib64/nvidia/libcuda.so.1
#2  0x00002aaaac013e42 in cudbgApiDetach () from /usr/lib64/nvidia/libcuda.so.1
#3  0x00002aaaac00c63a in cudbgReportDriverInternalError () from /usr/lib64/nvidia/libcuda.so.1
#4  0x00002aaaac00d4ad in cudbgReportDriverInternalError () from /usr/lib64/nvidia/libcuda.so.1
#5  0x00002aaaac010297 in cudbgReportDriverInternalError () from /usr/lib64/nvidia/libcuda.so.1
#6  0x00002aaaac0103b9 in cudbgReportDriverInternalError () from /usr/lib64/nvidia/libcuda.so.1
#7  0x00002aaaac0aeca4 in cuEGLInit () from /usr/lib64/nvidia/libcuda.so.1
#8  0x00002aaaabffac9d in cuMemGetAttribute_v2 () from /usr/lib64/nvidia/libcuda.so.1
#9  0x00002aaaabffafb0 in cuMemGetAttribute_v2 () from /usr/lib64/nvidia/libcuda.so.1
#10 0x00000000004328ed in cudart::contextState::loadCubin(bool*, void**) ()
#11 0x0000000000427d50 in cudart::globalModule::loadIntoContext(cudart::contextState*) ()
#12 0x0000000000436026 in cudart::contextState::applyChanges() ()
#13 0x0000000000438ef1 in cudart::contextStateManager::getRuntimeContextState(cudart::contextState**, bool) ()
#14 0x000000000042becc in cudart::doLazyInitContextState() ()
#15 0x000000000040ec48 in cudart::cudaApiMalloc(void**, unsigned long) ()
#16 0x000000000044c908 in cudaMalloc ()
#17 0x00000000004036e8 in main () at ../src/radixsortmsdDP.cu:182

It corresponds to this code segment:

156	int main(void)
157	{
158		const int ARRAY_SIZE = 1 << 9;
159		const int BLOCK_NUM_THREAD = 8;
160		const int ITEMS_PER_THREAD = 8;
161		/*
162		 // 128, 128, 1024 takes long - it works.
163		 const int BLOCK_NUM_THREAD = 128;
164		 const int ITEMS_PER_THREAD = 512;
165		 const int ARRAY_SIZE = BLOCK_NUM_THREAD*ITEMS_PER_THREAD*128;
166		 */
167		const int ARRAY_BYTES = ARRAY_SIZE * sizeof(int);
168	//	const int GRID_SIZE = ARRAY_SIZE / (BLOCK_NUM_THREAD * ITEMS_PER_THREAD)+ 1;
169	//	const int ARRAY_SIZE_PADDED = GRID_SIZE * BLOCK_NUM_THREAD * ITEMS_PER_THREAD;
170	//	const int ARRAY_BYTES_PADDED = ARRAY_SIZE_PADDED * sizeof(int);
171		int *d_in;
172		int *d_out;
173		unsigned int *d_num_elems, *d_offsets, *d_bitnum_beg;
174	
175		// generate inp array on host
176		int h_in[ARRAY_SIZE];
177		int h_out[ARRAY_SIZE];
178		//randomFill(h_in, ARRAY_SIZE);
179		reverseFill(h_in, ARRAY_SIZE);
180	
181		std::cout<<"Starting malloc d_in..."<<std::endl;
182		cudaMalloc((void ** )&d_in, ARRAY_BYTES);
183		std::cout<<"Starting malloc d_out..."<<std::endl;
184		CUDA_CHECK_RETURN(cudaMalloc((void ** )&d_out, ARRAY_BYTES));
185		std::cout<<"Starting malloc d_num_elems..."<<std::endl;
186		CUDA_CHECK_RETURN(cudaMalloc((void ** )&d_num_elems, sizeof(unsigned int)));
187		std::cout<<"Starting malloc d_offsets..."<<std::endl;
188		CUDA_CHECK_RETURN(cudaMalloc((void ** )&d_offsets, sizeof(unsigned int)));
189		std::cout<<"Starting malloc d_bitnum_beg..."<<std::endl;
190		CUDA_CHECK_RETURN(cudaMalloc((void ** )&d_bitnum_beg, sizeof(unsigned int)));
191		std::cout<<"Done mallocs..."<<std::endl;

My compile options are as follows as set by nsight EE:

--cudart static -G -g -O0 -std=c++11 -gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 --relocatable-device-code=true

Hi, tirpankar.n

Can you clarify your problem clearly?

  1. What do you mean remote box? Is it a board or desktop?
  2. Which toolkit version are you use ? And which driver version?
  3. What about debug within cuda-gdb directly in remote side?
  4. Was this specific to your sample ? SDK sample works ?
  1. What do you mean remote box? Is it a board or desktop?
    It is an Intel Xeon box in a cluster. So NOT a NVIDIA board.

  2. Which toolkit version are you use ? And which driver version?
    We have a lmod system setup on the cluster nodes.
    I tried using both toolkit cuda/8.0(release 8.0, V8.0.61) as well as cuda/9.1(release 9.1, V9.1.85).
    Corresponding to cuda 9.1 - CUDA Driver Version / Runtime Version 9.1 / 9.1 Capability Major/Minor version number: 6.0

  3. What about debug within cuda-gdb directly in remote side?
    Debugging directly on the remote side(logging in using ssh to the box WITHOUT -X) and running cuda-gdb causes the same failure.

  4. Was this specific to your sample ? SDK sample works ?
    That is worth checking. The failure is specific to my code.
    I have not tried the SDK samples. I am going to try out the samples cuda-9.1/samples/6_Advanced/cdpQuadtree and post my result here.

To put the problem accurately, the issue is that when I generate a project using nsight EE to use dynamic parallelism it basically hangs on a cuda API call. In my case this happens to be cudaMalloc as you can see in the stack trace generated by cuda-gdb in my post above. This happens when I use cuda-gdb to debug the application i.e.:

cuda-gdb ./myapp
...
(cuda-gdb) run
Starting program: ~/myapp/Debug/myapp
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x2aaab2fb9700 (LWP 26648)]
[New Thread 0x2aaab31ba700 (LWP 26649)]

So my application is hung up at this point. The stack trace that I have shared in my previous post is from when I send cuda-gdb a SIGINT (^C) at this point of time and print the stack-trace(info stack)

If I run my application directly, it does not hang at the first cuda API call. i.e. if I run it directly as:
[me@mynode Debug]./myapp

Not sure if this info is relevant:( So could be a problem with how cuda-gdb is handling API calls I believe. I also added the library cudadevrt to the link options and I still have the same issue. I tried compiling with “–cudart none” instead of “–cudart static” but in that case it cannot find the relevant runtime API link file. )

Here is the debug build log:

12:58:39 **** Build of configuration Debug for project radixsortmsdDP ****
make all 
Building file: ../src/drecho.cpp
Invoking: NVCC Compiler
/usr/local/cuda-9.1/bin/nvcc -I/home/me/cuda-workspace/radixsortmsdDP/cub-1.7.5 -G -g -O0 -std=c++11 -gencode arch=compute_60,code=sm_60  -odir "src" -M -o "src/drecho.d" "../src/drecho.cpp"
/usr/local/cuda-9.1/bin/nvcc -I /home/me/cuda-workspace/radixsortmsdDP/cub-1.7.5 -G -g -O0 -std=c++11 --compile  -x c++ -o  "src/drecho.o" "../src/drecho.cpp"
Finished building: ../src/drecho.cpp
 
Building file: ../src/radixsortmsdDP.cu
Invoking: NVCC Compiler
/usr/local/cuda-9.1/bin/nvcc -I/home/me/cuda-workspace/radixsortmsdDP/cub-1.7.5 -G -g -O0 -std=c++11 -gencode arch=compute_60,code=sm_60  -odir "src" -M -o "src/radixsortmsdDP.d" "../src/radixsortmsdDP.cu"
/usr/local/cuda-9.1/bin/nvcc -I/home/me/cuda-workspace/radixsortmsdDP/cub-1.7.5 -G -g -O0 -std=c++11 --compile --relocatable-device-code=true -gencode arch=compute_60,code=sm_60  -x cu -o  "src/radixsortmsdDP.o" "../src/radixsortmsdDP.cu"
Finished building: ../src/radixsortmsdDP.cu
 
Building target: radixsortmsdDP
Invoking: NVCC Linker
/usr/local/cuda-9.1/bin/nvcc --cudart static --relocatable-device-code=true -gencode arch=compute_60,code=sm_60 -link -o  "radixsortmsdDP"  ./src/drecho.o ./src/radixsortmsdDP.o   -lcudadevrt
Finished building target: radixsortmsdDP
 

12:58:54 Build Finished (took 14s.696ms)

NOTE: I have scrubbed some paths to remove personal information!

OK. I got your problem now.

Your application runs well seperately.
But if run in cuda-gdb, it hangs. Right ?

As it hangs at cudaMalloc, can you just cuda-memcheck ./myapp to check if any thing wrong about memory usage ?