Computer crash due atomicAdd. Why?

Here is a very small code. It does nothing but crash the computer, only the reset button helps. Or sometimes I have got “unspecified launch failure”. If I uncomment the line #15 (s[0] = 0;) it does not crash, nor “unspecified launch failure”. Can anybody explain why?

CUDA 8. (same results in 7.5) Gpu: GTX1070 (8Gbyte) CPU: i7-6700K 4Ghz 32Gbyte.

Build Release configuration and x64

#include <cuda_runtime_api.h>
#include <cuda_runtime.h>
#include <device_functions.h>
#include <device_launch_parameters.h>
#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>

void crash();

__global__
void crashkernel(int * nothing)
{
	__shared__ int s[1];
	//s[0] = 0;
	__syncthreads();
	atomicAdd(s + 0, 1);
}

int main()
{
	crash();
	system("pause");
	return 0;
}

void crash()
{
	int w = 1024;
	int h = 1024;
	for (int i = 0; i < 10000; i++)
	{
		dim3 grid(w / 32, h / 32);
		dim3 block = (1024);
		crashkernel << <grid, block >> > (0);
		cudaError_t cudaStatus = cudaGetLastError();
		if (cudaStatus != cudaSuccess)
		{
			fprintf(stderr, "addKernel launch failed: %s\n", cudaGetErrorString(cudaStatus));
			system("pause");
		}
		cudaDeviceSynchronize();
		fprintf(stderr, "%d\r\n", i);
	}
}

It’s evident you are on windows. My guess would be a WDDM TDR timeout. You may wish to Google that.

NVIDIA nsight options:

WDDM TDR delay 2
WDDM TDR enabled true

My guess currently is hardware or driver problem. I will check the code in another configuration (quadro p4000 dual xeon)

I tried running the code and I can reproduce a GPU kernel hang on Pascal Titan X, 375.66, CUDA 8.0.61, Ubuntu 14.04

So I would agree there is something strange here. Don’t have any further info yet.

As near as I can tell this is being caused by code generation in ptxas, and appears to be fixed in CUDA 9, or a newer driver, depending on how your SASS is being generated.

I ran a test with CUDA 7.5 and a GTX960, and was able to reproduce the issue with the following SASS, generated by:

$ /usr/local/cuda-7.5/bin/nvcc -arch=sm_52 -o t349 t349.cu
$ cuobjdump -sass ./t349
                ...........................
        code for sm_52
                Function : _Z11crashkernelv
        .headerflags    @"EF_CUDA_SM52 EF_CUDA_PTX_SM(EF_CUDA_SM52)"
                                                                                  /* 0x001fc000ffe007f6 */
        /*0008*/                   MOV R1, c[0x0][0x20];                          /* 0x4c98078000870001 */
        /*0010*/                   BAR.SYNC 0x0;                                  /* 0xf0a81b8000070000 */
        /*0018*/         {         PSETP.AND.AND P0, PT, PT, PT, PT;              /* 0x50900380e0077007 */
        /*0028*/                   S2R R2, SR_LANEID;        }                    /* 0x00044800fe40070d */
                                                                                  /* 0xf0c8000000070002 */
        /*0030*/                   VOTE.ALL R3, PT, P0;                           /* 0x50d8e00000070003 */
        /*0038*/                   FLO.U32 R0, R3;                                /* 0x5c30000000370000 */
                                                                                  /* 0x0087c401fda0075d */
        /*0048*/                   POPC R3, R3;                                   /* 0x5c08000000370003 */
        /*0050*/                   ISETP.EQ.U32.AND P0, PT, R0, R2, PT;           /* 0x5b64038000270007 */
        /*0058*/               @P0 ATOMS.ADD RZ, [RZ], R3;                        /* 0xec0000000030ffff */
                                                                                  /* 0x001f8000ffe007ff */
        /*0068*/                   EXIT;                                          /* 0xe30000000007000f */
        /*0070*/                   BRA 0x70;                                      /* 0xe2400fffff87000f */
        /*0078*/                   NOP;                                           /* 0x50b0000000070f00 */
                ...........................

In this scenario, the kernel would eventually hang, usually after a few thousand iterations

Next I compiled with CUDA 9 EA:

$ /usr/local/cuda-9.0/bin/nvcc -arch=sm_52 -o t349 t349.cu
$ cuobjdump -sass ./t349
                ...........................
        code for sm_52
                Function : _Z11crashkernelv
        .headerflags    @"EF_CUDA_SM52 EF_CUDA_PTX_SM(EF_CUDA_SM52)"
                                                                                  /* 0x001c4800fe0007f6 */
        /*0008*/                   MOV R1, c[0x0][0x20];                          /* 0x4c98078000870001 */
        /*0010*/         {         VOTE.ALL R0, PT, PT;                           /* 0x50d8e38000070000 */
        /*0018*/                   S2R R4, SR_LANEID;        }                    /* 0xf0c8000000070004 */
                                                                                  /* 0x001ffc00e6200711 */
        /*0028*/                   FLO.U32 R3, R0;                                /* 0x5c30000000070003 */
        /*0030*/                   POPC R2, R0;                                   /* 0x5c08000000070002 */
        /*0038*/                   BAR.SYNC 0x0;                                  /* 0xf0a81b8000070000 */
                                                                                  /* 0x001ffc021e200fed */
        /*0048*/                   ISETP.EQ.U32.AND P0, PT, R3, R4, PT;           /* 0x5b64038000470307 */
        /*0050*/               @P0 ATOMS.ADD RZ, [RZ], R2;                        /* 0xec0000000020ffff */
        /*0058*/                   EXIT;                                          /* 0xe30000000007000f */
                                                                                  /* 0x001f8000fc0007ff */
        /*0068*/                   BRA 0x60;                                      /* 0xe2400fffff07000f */
        /*0070*/                   NOP;                                           /* 0x50b0000000070f00 */
        /*0078*/                   NOP;                                           /* 0x50b0000000070f00 */
                ...........................

In this scenario, after many test runs, the code would always complete 10000 iterations.

A few other notes:

If I compile with CUDA 7.5, but force a JIT operation:

$ /usr/local/cuda-9.0/bin/nvcc -arch=sm_30 -o t349 t349.cu

Then this always completes successfully, since I have 384.27 driver on this machine (the CUDA 9 EA driver).

Likewise, if I compile in any scenario with -G, the code always completes successfully.

So you can file a bug if you want, but the typical behavior is to check if it is fixed in a newer driver/CUDA toolkit. In this case it appears to be, and CUDA 9 release should occur over the next several months.

You’re welcome to look for a newer driver with a fix, if you want to, but this will require that you compile with PTX that will force a JIT compile on your GPU.

And if you have access to CUDA 9 EA, I would suggest testing with that.

All of the tests in this post were on Fedora 20, GTX 960, driver 384.27

Finally, I have only done testing on linux, so it’s possible that your windows behavior/observations may be different.

Thank you.

Could you explain why hang the kernel in the first scenario? And why crash the whole system?

can’t explain either at the moment. The first one I suspect is due to some instruction sequencing issue. The second one may be a result of whatever hang condition this causes not recovering well from the TDR mechanism.

Just speculation.

I don’t even know if what I’m discussing here has any connection to your windows observations. Just speculation.