Error: Failed to suspend device for CUDA device 0

Hi,

I am trying to develop a simple CUDA application using NSight on Ubuntu 16.04LTS. I am working with a persistent kernel which is mean to process incoming coordinates. It is based on a double buffer as shown here: (c++ - Doubling buffering in CUDA so the CPU can operate on data produced by a persistent kernel - Stack Overflow). It works fine, but if I decide to instantiate an extremely innocuous class at some point in the main method, the program hangs itself with a segmentation fault:

Thread 1 "SOFAS" received signal SIGSEGV, Segmentation fault.
0x0000000000404724 in main () at ../src/SOFAS.cu:155
155			printf("0 bufrdy=%d \n", *bufrdy);
[Thread 0x7fffee9f0700 (LWP 3003) exited]
[Thread 0x7fffef272700 (LWP 3002) exited]
[Thread 0x7fffefa73700 (LWP 3001) exited]
Error: Failed to suspend device for CUDA device 0, error=CUDBG_ERROR_UNKNOWN(0x1).

Here is the main method:

int main() {
	test t(10);

	int *hBuf1, *dBuf1, *hBuf2, *dBuf2;
	volatile int *bufrdy1Flag, *bufrdy2Flag;
	// buffer and "mailbox" setup
	//Allocate host memory for the buffers and the flags
	cudaHostAlloc(&hBuf1, DSIZE * sizeof(int), cudaHostAllocDefault);
	cudaHostAlloc(&hBuf2, DSIZE * sizeof(int), cudaHostAllocDefault);
	cudaHostAlloc(&bufrdy1Flag, sizeof(int), cudaHostAllocMapped);
	cudaHostAlloc(&bufrdy2Flag, sizeof(int), cudaHostAllocMapped);
	cudaCheckErrors("cudaHostAlloc fail");
	//Allocate device memory
	cudaMalloc(&dBuf1, DSIZE * sizeof(int));
	cudaMalloc(&dBuf2, DSIZE * sizeof(int));
	cudaCheckErrors("cudaMalloc fail");
	//Create the CUDA streams
	cudaStream_t streamk, streamc;
	cudaStreamCreate(&streamk);
	cudaStreamCreate(&streamc);
	cudaCheckErrors("cudaStreamCreate fail");
	*bufrdy1Flag = 0;
	*bufrdy2Flag = 0;
	//Fill device memory buffers with 255's
	cudaMemset(dBuf1, 0xFF, DSIZE * sizeof(int));
	cudaMemset(dBuf2, 0xFF, DSIZE * sizeof(int));
	cudaCheckErrors("cudaMemset fail");
	// inefficient crutch for choosing number of blocks
	int nblock = 0;
	cudaDeviceGetAttribute(&nblock, cudaDevAttrMultiProcessorCount, 0);
	cudaCheckErrors("get multiprocessor count fail");
	printf("kernel launching with <<<%d,%d>>> \n",nblock, nTPB);
	testkernel<<<nblock, nTPB, 0, streamk>>>(dBuf1, dBuf2, bufrdy1Flag,
			bufrdy2Flag, DSIZE, ITERS);
	cudaCheckErrors("kernel launch fail");
	volatile int *bufrdy;
	int *hbuf, *dbuf;

	for (int i = 0; i < ITERS; i++) {
		if (i%2) {  // ping pong on the host side
			bufrdy = bufrdy2Flag;
			hbuf = hBuf2;
			dbuf = dBuf2;
		} else {
			bufrdy = bufrdy1Flag;
			hbuf = hBuf1;
			dbuf = dBuf1;
		}
		// int qq = 0; // add for failsafe - otherwise a machine failure can hang
		while ((*bufrdy) != 2); // use this for a failsafe:  if (++qq > 1000000) {printf("bufrdy = %d\n", *bufrdy); return 0;} // wait for buffer to be full;
		cudaMemcpyAsync(hbuf, dbuf, DSIZE * sizeof(int), cudaMemcpyDeviceToHost, streamc);
		cudaStreamSynchronize(streamc);
		cudaCheckErrors("cudaMemcpyAsync fail");
		*bufrdy = 0; // release buffer back to device
		if (!validate(hbuf, DSIZE, i)) {
			printf("validation failure at iter %d\n", i);
			exit(1);
		}
	}
	printf("Completed %d iterations successfully\n", ITERS);
}

The weird thing is that if I comment that first line, test t(10), the whole thing works perfectly. The Test class is trivial, I include it here for completeness:
test.h

#ifndef __TEST_H_
#define __TEST_H_
#include <array>

class test {
public:
	test(int a);
private:
	// Sampling the radius estimates - velocities are sampled off an exponential distribution 2^-(x-t)-5
	const double expBase = 2;
	const double expVerticalShift = 5;
	/** The maximum velocity estimate that can be sampled */
	const double maxSampleValue = 8000;
	/** The minimum velocity estimate that can be sampled */
	const double minSampleValue = 0.1;
	const double minSampleX = 0;
	const double maxAccelerationSample = 500000;
	const double timeShift=0;
	const double maxSampleX=0;
	//const double sampleFac=0;
	//static const int numberOfRadiusEstimates = 50;
	//std::array<double, numberOfRadiusEstimates> radiusLUT;

	// Sampling the angle estimates
	//static const int numberOfAngleEstimates = 8;
	//std::array<std::array<double, 3>, numberOfAngleEstimates> angleLUT;

	//Sampling the acceleration estimates
	//static const int numberOfAccelerationEstimates = 1;
	//std::array<double, numberOfAccelerationEstimates> accelerationLUT;

	void init(int a);
};

#endif

test.cpp

#include "test.h"
#include <stdio.h>

test::test(int a) {
	printf("generating class...\n");
	init(a);
}

void test::init(int a) {
	int *arr = new int[a];
	for(int i=0; i<a; i++) {
		arr[i]=a*i;
	}
	for(int i=0; i<a; i++) {
		printf("arr(%d)=%d \n",i,arr[i]);
	}
}

Now for the really weird part: You’ll notice that in test.h I’ve commented a bunch of random variables that I don’t actually ever use. Well if I uncomment any further than that, so if I uncomment line 20, it fails with the error message “Error: Failed to suspend device for CUDA device 0” - as it is now it works fine.

I really can’t seem to figure this out. Many thanks for your help!
Timo

I assume you meant “If I comment…” rather than “If I uncomment…” that first line is already uncommented.

I think the problem is likely in something you haven’t shown. I attempted to assemble a working code out of what you had shown + what was in the SO thread you linked. I had to comment out

*taskComplete = 0;

since that isn’t defined in your main function or in anything you’ve shown, or in the article you linked.

Other than that, the code works fine for me:

$ cat t7.cu
#include <stdio.h>

#define ITERS 1000
#define DSIZE 65536
#define nTPB 256

#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)


__device__ volatile int blkcnt1 = 0;
__device__ volatile int blkcnt2 = 0;
__device__ volatile int itercnt = 0;

__device__ void my_compute_function(int *buf, int idx, int data){
  buf[idx] = data;  // put your work code here
}

__global__ void testkernel(int *buffer1, int *buffer2, volatile int *buffer1_ready, volatile int *buffer2_ready,  const int buffersize, const int iterations){
  // assumption of persistent block-limited kernel launch
  int idx = threadIdx.x+blockDim.x*blockIdx.x;
  int iter_count = 0;
  while (iter_count < iterations ){ // persistent until iterations complete
    int *buf = (iter_count & 1)? buffer2:buffer1; // ping pong between buffers
    volatile int *bufrdy = (iter_count & 1)?(buffer2_ready):(buffer1_ready);
    volatile int *blkcnt = (iter_count & 1)?(&blkcnt2):(&blkcnt1);
    int my_idx = idx;
    while (iter_count - itercnt > 1); // don't overrun buffers on device
    while (*bufrdy == 2);  // wait for buffer to be consumed
    while (my_idx < buffersize){ // perform the "work"
      my_compute_function(buf, my_idx, iter_count);
      my_idx += gridDim.x*blockDim.x; // grid-striding loop
      }
    __syncthreads(); // wait for my block to finish
    __threadfence(); // make sure global buffer writes are "visible"
    if (!threadIdx.x) atomicAdd((int *)blkcnt, 1); // mark my block done
    if (!idx){ // am I the master block/thread?
      while (*blkcnt < gridDim.x);  // wait for all blocks to finish
      *blkcnt = 0;
      *bufrdy = 2;  // indicate that buffer is ready
      __threadfence_system(); // push it out to mapped memory
      itercnt++;
      }
    iter_count++;
    }
}

int validate(const int *data, const int dsize, const int val){

  for (int i = 0; i < dsize; i++) if (data[i] != val) {printf("mismatch at %d, was: %d, should be: %d\n", i, data[i], val); return 0;}
  return 1;
}

#include <array>

class test {
public:
        test(int a);
private:
        // Sampling the radius estimates - velocities are sampled off an exponential distribution 2^-(x-t)-5
        const double expBase = 2;
        const double expVerticalShift = 5;
        /** The maximum velocity estimate that can be sampled */
        const double maxSampleValue = 8000;
        /** The minimum velocity estimate that can be sampled */
        const double minSampleValue = 0.1;
        const double minSampleX = 0;
        const double maxAccelerationSample = 500000;
        const double timeShift=0;
        const double maxSampleX=0;
        //const double sampleFac=0;
        //static const int numberOfRadiusEstimates = 50;
        //std::array<double, numberOfRadiusEstimates> radiusLUT;

        // Sampling the angle estimates
        //static const int numberOfAngleEstimates = 8;
        //std::array<std::array<double, 3>, numberOfAngleEstimates> angleLUT;

        //Sampling the acceleration estimates
        //static const int numberOfAccelerationEstimates = 1;
        //std::array<double, numberOfAccelerationEstimates> accelerationLUT;

        void init(int a);
};

test::test(int a) {
        printf("generating class...\n");
        init(a);
}

void test::init(int a) {
        int *arr = new int[a];
        for(int i=0; i<a; i++) {
                arr[i]=a*i;
        }
        for(int i=0; i<a; i++) {
                printf("arr(%d)=%d \n",i,arr[i]);
        }
}

int main() {
        test t(10);

        int *hBuf1, *dBuf1, *hBuf2, *dBuf2;
        volatile int *bufrdy1Flag, *bufrdy2Flag;
        // buffer and "mailbox" setup
        //Allocate host memory for the buffers and the flags
        cudaHostAlloc(&hBuf1, DSIZE * sizeof(int), cudaHostAllocDefault);
        cudaHostAlloc(&hBuf2, DSIZE * sizeof(int), cudaHostAllocDefault);
        cudaHostAlloc(&bufrdy1Flag, sizeof(int), cudaHostAllocMapped);
        cudaHostAlloc(&bufrdy2Flag, sizeof(int), cudaHostAllocMapped);
        cudaCheckErrors("cudaHostAlloc fail");
        //Allocate device memory
        cudaMalloc(&dBuf1, DSIZE * sizeof(int));
        cudaMalloc(&dBuf2, DSIZE * sizeof(int));
        cudaCheckErrors("cudaMalloc fail");
        //Create the CUDA streams
        cudaStream_t streamk, streamc;
        cudaStreamCreate(&streamk);
        cudaStreamCreate(&streamc);
        cudaCheckErrors("cudaStreamCreate fail");
        *bufrdy1Flag = 0;
        *bufrdy2Flag = 0;
        //*taskComplete = 0;
        //Fill device memory buffers with 255's
        cudaMemset(dBuf1, 0xFF, DSIZE * sizeof(int));
        cudaMemset(dBuf2, 0xFF, DSIZE * sizeof(int));
        cudaCheckErrors("cudaMemset fail");
        // inefficient crutch for choosing number of blocks
        int nblock = 0;
        cudaDeviceGetAttribute(&nblock, cudaDevAttrMultiProcessorCount, 0);
        cudaCheckErrors("get multiprocessor count fail");
        printf("kernel launching with <<<%d,%d>>> \n",nblock, nTPB);
        testkernel<<<nblock, nTPB, 0, streamk>>>(dBuf1, dBuf2, bufrdy1Flag,
                        bufrdy2Flag, DSIZE, ITERS);
        cudaCheckErrors("kernel launch fail");
        volatile int *bufrdy;
        int *hbuf, *dbuf;

        for (int i = 0; i < ITERS; i++) {
                if (i%2) {  // ping pong on the host side
                        bufrdy = bufrdy2Flag;
                        hbuf = hBuf2;
                        dbuf = dBuf2;
                } else {
                        bufrdy = bufrdy1Flag;
                        hbuf = hBuf1;
                        dbuf = dBuf1;
                }
                // int qq = 0; // add for failsafe - otherwise a machine failure can hang
                while ((*bufrdy) != 2); // use this for a failsafe:  if (++qq > 1000000) {printf("bufrdy = %d\n", *bufrdy); return 0;} // wait for buffer to be full;
                cudaMemcpyAsync(hbuf, dbuf, DSIZE * sizeof(int), cudaMemcpyDeviceToHost, streamc);
                cudaStreamSynchronize(streamc);
                cudaCheckErrors("cudaMemcpyAsync fail");
                *bufrdy = 0; // release buffer back to device
                if (!validate(hbuf, DSIZE, i)) {
                        printf("validation failure at iter %d\n", i);
                        exit(1);
                }
        }
        printf("Completed %d iterations successfully\n", ITERS);
}
$ nvcc -std=c++11 -o t7 t7.cu
$ cuda-memcheck ./t7
========= CUDA-MEMCHECK
generating class...
arr(0)=0
arr(1)=10
arr(2)=20
arr(3)=30
arr(4)=40
arr(5)=50
arr(6)=60
arr(7)=70
arr(8)=80
arr(9)=90
kernel launching with <<<56,256>>>
Completed 1000 iterations successfully
========= ERROR SUMMARY: 0 errors
$

You should make sure that you are not running on a GPU that has the timeout enabled. Run the deviceQuery app and see if your GPU is reported as having kernel time limit.

In the future, my suggestion if you want help is to make it as easy as possible for others to help you. Provide a complete code, in a single listing, and don’t make someone have to assemble pieces from several different listings and links. Condense your reproducer code as small as possible, into a single file, and post it in its entirety so that someone like me can copy, paste, compile, and run, and see the issue, without having to add anything or change anything.

In my experience, the symptoms you are reporting are often associated with stack corruption. But I’m not able to spot any obvious problems and can’t reproduce it with the code you have shown.

So…it turns out it does have a kernel time limit. At least thats how I interpret the following:

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX 1070 with Max-Q Design"
  CUDA Driver Version / Runtime Version          9.0 / 9.0
  CUDA Capability Major/Minor version number:    6.1
  Total amount of global memory:                 8192 MBytes (8589934592 bytes)
  (16) Multiprocessors, (128) CUDA Cores/MP:     2048 CUDA Cores
  GPU Max Clock rate:                            1266 MHz (1.27 GHz)
  Memory Clock rate:                             4004 Mhz
  Memory Bus Width:                              256-bit
  L2 Cache Size:                                 2097152 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 2 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  CUDA Device Driver Mode (TCC or WDDM):         WDDM (Windows Display Driver Model)
  Device supports Unified Addressing (UVA):      Yes
  Supports Cooperative Kernel Launch:            No
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 1 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 9.0, CUDA Runtime Version = 9.0, NumDevs = 1
Result = PASS

Is there any work around for this?
Also, thanks for your suggestions on how to make my posts more useful for potential saviors :P Next time I’ll be sure to follow the guidelines you set out. Many thanks!

Is this GPU part of a laptop?

That’s right.

laptop GPUs may have various different hardware designs that allow them to work in concert with the intel iGPU in these laptops. Getting things working correctly can be especially challenging on Linux, as these laptops mostly ship with and are designed for use with windows.

In any event I wouldn’t be able to give you detailed directions for setup, and it may or may not be possible to get things working correctly without the runtime limit on the dGPU, in Linux, depending on your specific laptop design.

Thanks a lot for your help @txbob. I’m gonna keep trying to make it work for the next few days and I’ll post up what I figured out then.

I just want to know did you solve this problem? I have the same problem on my machine. My machine is Debian buster/sid. I tried various Linux kernels, CUDA 9.0/9.2/10.0 and my cuda code can run, but just can not been debug using cuda-gdb. It always gives me the error:
[New Thread …(LWP 4908)]
[New Thread …(LWP 4908)]
Error: Failed to suspend device for CUDA device 0, error = CUDBG_ERROR_COMMUNICATION_FAILURE