Any ideas about the following problem?..
launch cuda kernel - so far, so good
cudaThreadSynchronize - no error returned
cudaMemcpy - hangs and never returns
I originally thought perhaps my kernel was hanging, so I put in the cudaThreadSynchronize call. Presumably the kernel execution must be complete at this point.
Here is the bottom of a stack trace. It looks like cudaMemcpy is waiting for something (perhaps using gettimeofday for some kind of polling).
#0 0xffffffffff600064 in ?? ()
#1 0x0000000040a14580 in ?? ()
#2 0x000000325648bf9d in gettimeofday () from /lib64/libc.so.6
#3 0x00002aaaab4c1be2 in ?? () from /usr/lib64/libcuda.so
#4 0x00002aaaaaf1a9a2 in ?? () from /usr/lib64/libcuda.so
#5 0x00002aaaaaf1cd77 in ?? () from /usr/lib64/libcuda.so
#6 0x00002aaaaaf108a9 in ?? () from /usr/lib64/libcuda.so
#7 0x00002aaaaaf0747c in ?? () from /usr/lib64/libcuda.so
#8 0x00002aaaaaef211b in ?? () from /usr/lib64/libcuda.so
#9 0x00002aaaaaf99d14 in ?? () from /usr/lib64/libcuda.so
#10 0x00002b0a3069838e in ?? () from /usr/local/cuda/lib64/libcudart.so.3
#11 0x00002b0a30689fcf in cudaMemcpy ()
#12 0x000000000046a990 in _warppyramidCUDA_updateThresholds (
warppCUDA=0x5558740, l=2) at warpCUDA.cu:4379[/codebox]
I don’t have a small reproducer. This code is called many thousands of times as part of a large system, and does not typically fail until running for many minutes.
So, any ideas? Can a GPU memory overwrite possibly cause this, or is it not possible because of where I am hanging?
A few more details that could either confuse or enlighten things further. I am running on a GPU cluster with 7 GPUs per node and multiple nodes communicating via openmpi, running CentOS. The GPUs are C1060. The cuda driver and runtime are version 3.1. There is some multithreading, but each GPU actually has its own process, and only one thread in each process is talking to a GPU. We have investigated lots of possible problem sources, such as Infiniband pinned memory, but the problem occurs even when we totally disable infiniband.
While investigating this problem, I did find a couple of kernels (not this one) that had out-of-bounds read access (which I have already fixed). Just out of curiosity, can this cause kernels to hang if there are no associated out-of-bounds writes or infinite loops (i.e. is just the OOB read itself dangerous)? I have tried cuda-memcheck on most of my kernels in stand-alone versions, but not directly within the big system.