Machine and software:
Dell precision T7500 with a Geforce GTX Titan X GPU, driver version 352.63, running Ubuntu 14.04, using Nsight Eclipse v7.5 and Cuda compilation tools v7.5.17, in a single GPU setup
Problem:
On the first call of cudaMalloc or cudaFree, Nsight hangs and the computer must be rebooted. After searching on Google, and reading this: https://devtalk.nvidia.com/default/topic/546956/ and this: https://devblogs.nvidia.com/parallelforall/cuda-pro-tip-understand-fat-binaries-jit-caching/, the likely cause is that the JIT compiler causes the delay at the context creation, e.g when the first call to cudaMalloc or cudaFree(0) is made. There seems to be an indefinite delay, we have tried to wait 30 minutes without it resolving. We have also tried to disable the JIT compilation altogether by going to Project/Properties/Build/Settings/CUDA and checking “5.2” next to “Generate GPU code”, and not checking anything next “Generate PTX code”, but Nsight stills is hanging the computer. The command line for the compiling is:
Were you able to find a solution to this? Its annoying to not be able to debug dynamic paralleled code. I am logging in to a remote box which is not running X and it is still stuck. Here is the stack trace from my application:
#0 0x00002aaaaaee5c6d in sendmsg () from /lib64/libpthread.so.0
#1 0x00002aaaac013a48 in cudbgApiDetach () from /usr/lib64/nvidia/libcuda.so.1
#2 0x00002aaaac013e42 in cudbgApiDetach () from /usr/lib64/nvidia/libcuda.so.1
#3 0x00002aaaac00c63a in cudbgReportDriverInternalError () from /usr/lib64/nvidia/libcuda.so.1
#4 0x00002aaaac00d4ad in cudbgReportDriverInternalError () from /usr/lib64/nvidia/libcuda.so.1
#5 0x00002aaaac010297 in cudbgReportDriverInternalError () from /usr/lib64/nvidia/libcuda.so.1
#6 0x00002aaaac0103b9 in cudbgReportDriverInternalError () from /usr/lib64/nvidia/libcuda.so.1
#7 0x00002aaaac0aeca4 in cuEGLInit () from /usr/lib64/nvidia/libcuda.so.1
#8 0x00002aaaabffac9d in cuMemGetAttribute_v2 () from /usr/lib64/nvidia/libcuda.so.1
#9 0x00002aaaabffafb0 in cuMemGetAttribute_v2 () from /usr/lib64/nvidia/libcuda.so.1
#10 0x00000000004328ed in cudart::contextState::loadCubin(bool*, void**) ()
#11 0x0000000000427d50 in cudart::globalModule::loadIntoContext(cudart::contextState*) ()
#12 0x0000000000436026 in cudart::contextState::applyChanges() ()
#13 0x0000000000438ef1 in cudart::contextStateManager::getRuntimeContextState(cudart::contextState**, bool) ()
#14 0x000000000042becc in cudart::doLazyInitContextState() ()
#15 0x000000000040ec48 in cudart::cudaApiMalloc(void**, unsigned long) ()
#16 0x000000000044c908 in cudaMalloc ()
#17 0x00000000004036e8 in main () at ../src/radixsortmsdDP.cu:182
It corresponds to this code segment:
156 int main(void)
157 {
158 const int ARRAY_SIZE = 1 << 9;
159 const int BLOCK_NUM_THREAD = 8;
160 const int ITEMS_PER_THREAD = 8;
161 /*
162 // 128, 128, 1024 takes long - it works.
163 const int BLOCK_NUM_THREAD = 128;
164 const int ITEMS_PER_THREAD = 512;
165 const int ARRAY_SIZE = BLOCK_NUM_THREAD*ITEMS_PER_THREAD*128;
166 */
167 const int ARRAY_BYTES = ARRAY_SIZE * sizeof(int);
168 // const int GRID_SIZE = ARRAY_SIZE / (BLOCK_NUM_THREAD * ITEMS_PER_THREAD)+ 1;
169 // const int ARRAY_SIZE_PADDED = GRID_SIZE * BLOCK_NUM_THREAD * ITEMS_PER_THREAD;
170 // const int ARRAY_BYTES_PADDED = ARRAY_SIZE_PADDED * sizeof(int);
171 int *d_in;
172 int *d_out;
173 unsigned int *d_num_elems, *d_offsets, *d_bitnum_beg;
174
175 // generate inp array on host
176 int h_in[ARRAY_SIZE];
177 int h_out[ARRAY_SIZE];
178 //randomFill(h_in, ARRAY_SIZE);
179 reverseFill(h_in, ARRAY_SIZE);
180
181 std::cout<<"Starting malloc d_in..."<<std::endl;
182 cudaMalloc((void ** )&d_in, ARRAY_BYTES);
183 std::cout<<"Starting malloc d_out..."<<std::endl;
184 CUDA_CHECK_RETURN(cudaMalloc((void ** )&d_out, ARRAY_BYTES));
185 std::cout<<"Starting malloc d_num_elems..."<<std::endl;
186 CUDA_CHECK_RETURN(cudaMalloc((void ** )&d_num_elems, sizeof(unsigned int)));
187 std::cout<<"Starting malloc d_offsets..."<<std::endl;
188 CUDA_CHECK_RETURN(cudaMalloc((void ** )&d_offsets, sizeof(unsigned int)));
189 std::cout<<"Starting malloc d_bitnum_beg..."<<std::endl;
190 CUDA_CHECK_RETURN(cudaMalloc((void ** )&d_bitnum_beg, sizeof(unsigned int)));
191 std::cout<<"Done mallocs..."<<std::endl;
My compile options are as follows as set by nsight EE:
What do you mean remote box? Is it a board or desktop?
It is an Intel Xeon box in a cluster. So NOT a NVIDIA board.
Which toolkit version are you use ? And which driver version?
We have a lmod system setup on the cluster nodes.
I tried using both toolkit cuda/8.0(release 8.0, V8.0.61) as well as cuda/9.1(release 9.1, V9.1.85).
Corresponding to cuda 9.1 - CUDA Driver Version / Runtime Version 9.1 / 9.1 Capability Major/Minor version number: 6.0
What about debug within cuda-gdb directly in remote side?
Debugging directly on the remote side(logging in using ssh to the box WITHOUT -X) and running cuda-gdb causes the same failure.
Was this specific to your sample ? SDK sample works ?
That is worth checking. The failure is specific to my code.
I have not tried the SDK samples. I am going to try out the samples cuda-9.1/samples/6_Advanced/cdpQuadtree and post my result here.
To put the problem accurately, the issue is that when I generate a project using nsight EE to use dynamic parallelism it basically hangs on a cuda API call. In my case this happens to be cudaMalloc as you can see in the stack trace generated by cuda-gdb in my post above. This happens when I use cuda-gdb to debug the application i.e.:
cuda-gdb ./myapp
...
(cuda-gdb) run
Starting program: ~/myapp/Debug/myapp
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x2aaab2fb9700 (LWP 26648)]
[New Thread 0x2aaab31ba700 (LWP 26649)]
So my application is hung up at this point. The stack trace that I have shared in my previous post is from when I send cuda-gdb a SIGINT (^C) at this point of time and print the stack-trace(info stack)
If I run my application directly, it does not hang at the first cuda API call. i.e. if I run it directly as:
[me@mynode Debug]./myapp
Not sure if this info is relevant:( So could be a problem with how cuda-gdb is handling API calls I believe. I also added the library cudadevrt to the link options and I still have the same issue. I tried compiling with “–cudart none” instead of “–cudart static” but in that case it cannot find the relevant runtime API link file. )