What would cause of 1-second GPU lockups in CUDA program? How to debug them beyond nvprof?

See the gaps in this Nvprof screenshot: https://i.stack.imgur.com/VovPj.png

I’ve created and am running a real-time CUDA application using 2 GPUs. Both GPUs execute CUDA code, but GPU0 is also doing OpenGL and QML rendering (with some use of shaders). 99.99% of the time, this works perfectly. However, very occasionally there are times in the execution where the GPUs just lock up, halting both the processing/CUDA threads and the UI thread for almost exactly 1.0 second.

  • What might be responsible for halting the GPUs for 1 second at a time?
  • Does anyone know how I might be able to debug this situation further? Nvprof let's you see the CUDA things that might be halting/occupying a GPU, but gives no indication about rendering events.

Additional info:

  • OS: Ubuntu 14.04 with metacity windowing system
  • CUDA: currently 8.0.61. same behavior on 8.0.44, and also CUDA 6.5.
  • GPU: GTX 980, also GTX 1070
  • Driver: 375.39, also 375.66, 367.xx

Other steps taken:

  • ran nvidia-memcheck. memcheck, initcheck and racecheck are all clean.
  • turned the fan speed on to 100% to ensure there was no throttling
  • compiled a second process to periodically run short CUDA test kernels. That process halted at the exact same times as the main application.
  • A third non-CUDA command-line process continued running throughout the halts.

In each of the “gaps” I see “cuda activity”. For two of the gaps, there is activity in the runtime API timeline, and for one of the gaps, there is activity in the compute timeline. In nvvp, I would hover the mouse over these items and find out what they are. This is of course not a complete roadmap, but I would start by trying to find out what those are.

Does the app use CUDA/OpenGL interop?

Was the code compiled with the device debug switch (-G)?

@txbob

First Gap:
Thread 3444287232: CudaMalloc 924.998ms
Thread 3690985216: CudaMemcpy( 4 Bytes, DtoH ) 1001.215ms

Second Gap:
Thread 3444287232: CudaMalloc 829.126ms
Thread 3690985216: CudaStreamSync( 4 Bytes, DtoH ) 1001.864ms
GPU0 Compute. Grid=(47907,1,1). Block=(512,1,1). 1000.798ms see snippet below

Third Gap:
Thread 4253001472 cudaFree 1000.462ms ( how does cudaFree possibly take 1000ms? )

The compute is a simple reduction kernel that normally takes less than 10ms, even for much larger datasets than here. Code is below FWIW.

FindClosest2DPoint<512><<47907, 512, 0, stream>>>( ... )
template <int NUMTHREADSINBLOCK>
__global__ void FindClosest2DPoint(  const PointMapper* __restrict__ points,
                                                      const float maxMidpointDistSq,
                                                      unsigned int * tripleCounter,
                                                         const int max_allowed_triples,
                                                      const unsigned short * __restrict__ indsI,
                                                      const unsigned short * __restrict__ indsJ,
                                                      const unsigned int N,
                                                      const unsigned int M,
                                                        unsigned short * indsIOut,
                                                        unsigned short * indsJOut,
                                                        unsigned short * indsKOut,
                                                          bool* continueFlag)
{
    EXIT_IF_FALSE(continueFlag[0]);
    const unsigned int index = blockIdx.x;
    if (  index >=  N )
    {
        return;
    }

    int bestInd = -1;
    float bestDistSq = maxMidpointDistSq;
    //get the indices
    unsigned short indJ = indsJ[index];
    unsigned short indI = indsI[index];
    const float2 midpoint = points->midpoint( indI, indJ );
    const float edgeLength = points->distsq(indI,indJ);
    for (int indK =threadIdx.x; indK < M; indK+=NUMTHREADSINBLOCK)
    {
        float2 midpErr = midpoint - points->get_point( indK );
        float normalizedDistSq =  (midpErr.x*midpErr.x + midpErr.y*midpErr.y)/edgeLength;
        if ( normalizedDistSq < bestDistSq )
        {
            bestDistSq = normalizedDistSq;
            bestInd = indK;
        }
    }

    __shared__ unsigned short sdata_ind[NUMTHREADSINBLOCK];
    __shared__ float sdata_distsq[NUMTHREADSINBLOCK];
    sdata_distsq[threadIdx.x] = bestDistSq;
    sdata_ind[threadIdx.x] = bestInd;

    __syncthreads();
    /// sdata now contains NUMTHREADSINBLOCK sub-minima
    /// The rest of this does a pyramidal reduction
#pragma unroll
    for (int numInPyr = NUMTHREADSINBLOCK/2;  numInPyr > 32; numInPyr/=2 )
    {
        if (threadIdx.x < numInPyr)
        {
            if ( sdata_distsq[threadIdx.x + numInPyr] < sdata_distsq[threadIdx.x] )
            {
                sdata_distsq[threadIdx.x] = sdata_distsq[threadIdx.x + numInPyr];
                sdata_ind[threadIdx.x] = sdata_ind[threadIdx.x + numInPyr];
            }
        }
        __syncthreads();
    }
    // Warp-centric reduction
    if (threadIdx.x < 32)
    {
        // now that we are using warp-synchronous programming (below)
        // we need to declare our shared memory volatile so that the compiler
        // doesn't reorder stores to it and induce incorrect behavior.
        {
            volatile float *smem_d = sdata_distsq;
            volatile unsigned short *smem_i = sdata_ind;
            #pragma unroll
            for (int numInPyr = 32;  numInPyr > 0; numInPyr/=2 )
            {
                if ( smem_d[threadIdx.x + numInPyr] < smem_d[threadIdx.x] )
                {
                    smem_d[threadIdx.x] = smem_d[threadIdx.x + numInPyr];
                    smem_i[threadIdx.x] = smem_i[threadIdx.x + numInPyr];
                }
            }
            if ( threadIdx.x == 0 )
            {
                if (smem_d[0] < maxMidpointDistSq)
                {
                    const unsigned int ind = atomicInc( tripleCounter, max_allowed_triples);
                    if (ind < max_allowed_triples)
                    {
                        indsIOut[ind] = indI;
                        indsJOut[ind] = indJ;
                        indsKOut[ind] = smem_i[0];
                    }
                }
            }
        }
    }
}

@txbob

Does the app use CUDA/OpenGL interop?

No, there’s no data shared between CUDA and GL.

Was the code compiled with the device debug switch (-G)?

No. Would that help?

No, I was just checking. You should never evaluate the performance of CUDA code compiled with -G (in my opinion, obviously, like every word of every one of my posts).

Suggestion: refactor your code to get all the cudaMalloc and cudaFree operations out of the processing loop. This is good programming practice anyway.

I’m not suggesting that cudaMalloc and cudaFree should arbitrarily take 1s, but I don’t have your code to debug, either. These are synchronizing operations, meaning they halt the CPU thread until all previously issued CUDA activity is complete. Again, this doesn’t explain anything, and again, I don’t have your code to debug, but it is what I would strive for in a well-written app: Avoid cudaMalloc and cudaFree in time-sensitive processing loops, if for no other reason than that they are synchronizing and tend to break nice multi-stream behavior.

Perhaps something to try: Since the GPU may be sharing duties with the X window system, as a test I would also attempt to run this code or some refactored version (obviously the OGL stuff would have to be refactored) on a GPU that did not have an X-window system instantiated on it. In fact, since you have no CUDA/GL interop, if you can refactor your app to run OGL/display tasks on one GPU, and the CUDA tasks on a non-display GPU, that would be an interesting test. As it stands now, if the display system hiccups for some reason, that would certainly show up in CUDA processing, especially noticeable at sync-points. The GPU must context-switch between CUDA and display tasks, and if a display task occasionally became glitchy and took 1s instead of a few ms, this would be evident (I think) at sync points in the CUDA timeline. The CUDA sync point would have to wait for the GPU to contex-switch back to CUDA mode, and so a stream sync or a cudaMalloc/cudaFree would wait for the context-switch.

One final suggestion: start trimming things down, and see when the anomalous behavior disappears. If you can shrink it down to a few hundred lines of code that still reproduces the issue, it may be sanitized enough at that point that you can post it for other people to look at. Just an idea.

And: in case you are not, its good practice to use thorough CUDA error checking, any time you are having trouble with a CUDA code.