Un-specified Launch Failures on CTRL_C Driver corrupting contexts ??

Sarnath · January 28, 2011, 2:49pm

Hi All,

I am running the simple code below (dotProduct is a misnomer) on the default device of the GTX295 (not connected to display)

I run 5 instances of this code in background (./dotP & ./dotP & ./dotP & ./dotP & ./dotP &)

When I bring one of these processes to foreground and press CTRL_C, the system hangs for a few seconds and 2 instances die. One of them reports an ULF (unspecified launch failure).

It looks like the driver is corrupting the context of another running instance.

I can positively reproduce this case in my setup here. Sometimes, it does not happen on the first CTRL_C. But eventually happens before all the CTRL_Cs are exhausted (bring one by one to foreground(fg) and then CTRL_C)

We fear that this driver behavior could exist even in normal termination path (without pressing CTRL_C).

But we don’t have any solid evidence at the momment.

Request NVIDIA to look into this,

Many THANKS!

System info:

============

Ubuntu Lucid 10.04 x86_64

CUDA 3.2, Driver version: 260.19.26

nvcc -O2 -o dotP dotP.cu

GTX 295

#include <stdio.h>

#define MAX_N (4*1024*1024)

#define NUM_RUNS (10000)

#define ERR_CHECK(cuda_fn) \

    {\

    cudaError_t err = cuda_fn;\

    if (err != cudaSuccess) \

    {\

        printf("CUDA Error: Line %d : %s\n", __LINE__ , cudaGetErrorString(err));\

        exit(-1);\

    }\

    }

__global__ void dotProductGPU(float *a, float *b, float *c, int N)

{

    int idx = blockIdx.x*blockDim.x + threadIdx.x;

    float a1,b1;

    float sum;

for(int i= idx; i<N; i+=blockDim.x*gridDim.x)

    {

        sum = 0;

        a1 = a[idx];

        b1 = b[idx];

for(int j=0; j<100; j++)

        {

            sum += sqrtf(a1*a1 + b1*b1 - a1*b1 -b1*a1);

        }

c[idx] = sum;

    }

    return;

}

int doDotProduct(void)

{

    float *aGPU, *bGPU, *cGPU;

printf("Doing Dot Product\n");

    ERR_CHECK(cudaMalloc(&aGPU, MAX_N*sizeof(float)));

    ERR_CHECK(cudaMalloc(&bGPU, MAX_N*sizeof(float)));

    ERR_CHECK(cudaMalloc(&cGPU, MAX_N*sizeof(float)));

    //cudaMemset(aGPU, 0, MAX_N*sizeof(float));

    //cudaMemset(bGPU, 0, MAX_N*sizeof(float));

    for(int i=0; i<NUM_RUNS; i++)

    {

        dotProductGPU<<<1000, 96>>>(aGPU, bGPU, cGPU, MAX_N);

    }

    ERR_CHECK(cudaThreadSynchronize());

ERR_CHECK(cudaFree(aGPU));

    ERR_CHECK(cudaFree(bGPU));

    ERR_CHECK(cudaFree(cGPU));

    return 0;

}

int main(void)

{

    doDotProduct();

    return 0;

}

Sarnath · January 31, 2011, 12:48pm

Is any1 able to simulate the problem above?

NVIDIA,
Can you reproduce this issue? I am eager to know the other scenarios that can expose the bug…(if this is a bug in the driver)

avidday · January 31, 2011, 1:15pm

What you are seeing sounds suspiciously like this, although I had it in my head that it was fixed in the 3.2 driver cycle.

Sarnath · January 31, 2011, 1:49pm

May be… But our code is “not” dependent on matlab or exclusive mode though…
Lets see if NVIDIA can reproduce the issue…

Sarnath · February 1, 2011, 5:33am

Elsewhere, SPWorley (Cudesnick thread on system reboots - CUDA on Linux - Resolved to be a PSU issue) had mentioned that CUDA drivers had a problem with Multiple GPU applications running simultaneously. But he said that NVIDIA resolved those issues long back.

I think this problem pertains to that family.

Any case, I observed even wilder results today. I was running many instances of SDK samples (radixSort, simpleCUBLAS, convolution and other custom code) and ran this “dotP” test. None of the dotP tests failed although the system went temporarily un-responsive for a couple of times… But it caused seg-faults and ULFs on radixSort, simpleCUBLAS and other custom code.

I have raised a Bug Report with NVIDIA. Let us see what happens,

Sarnath · February 2, 2011, 1:06pm

Any one?

SPWorley · February 2, 2011, 5:34pm

It’s this bug I posted about last year. It’s still outstanding.

It’s pretty annoying in practice, since it means you can’t do development on a machine when you have important background compute happening on other GPUs.
If you ^C out of your test kernel, your other processes will die, losing your work.

Sarnath · February 3, 2011, 5:24am

Steve,

Thanks for this! I ran your code on both GPUs of a single GTX295 (single process) and simultaneously did “nvidia-smi -a”.

Here is the result of “nvidia-smi”. We were also running lot other GPU apps in the background…

But never we have seen such baffling nvidia-smi output. Look at the GPU usage.

“nvidia-smi -a” appeared hung until the 2 processes exited.

system: NVIDIA CUDA 3.2, 260.19.26, x86_64, Ubuntu 10.04 Lucid, GTX 295

hpc@ubuntu:~$ nvidia-smi -a

==============NVSMI LOG==============

Timestamp : Thu Feb 3 10:49:19 2011

Driver Version : 260.19.26

GPU 0:

    Product Name            : GeForce GTX 295

    PCI Device/Vendor ID    : 5e010de

    PCI Location ID         : 0:3:0

    Board Serial            : 154619159316

    Display                 : Not connected

    Temperature             : 82 C

    Utilization

        GPU                 : 463%

        Memory              : 10%

GPU 1:

    Product Name            : GeForce GTX 295

    PCI Device/Vendor ID    : 5e010de

    PCI Location ID         : 0:4:0

    Board Serial            : 558312530802

    Display                 : Connected

    Temperature             : 80 C

    Fan Speed               : 54%

    Utilization

        GPU                 : 313%

        Memory              : 99%

hpc@ubuntu:~$

Sarnath · February 3, 2011, 5:27am

Steve,

I am listing all the observations I have made regarding your test case (Thanks for this one)

I was able to reproduce your problem on my machine at the first attempt.

I changed the code to analyze the “error” code and print the error string.

The machine config is the same (10.04 Ubuntu 64-bit, Lucid, CUDA 3.2, 260.19.26)

We were able to reproduce the ULF using a “cudaMalloc” approach (each thread allocates) instead of portable-pinned memory. This time no other GPU instances were running and the system was fresh after a reboot. So, I think this problem has nothing to do with “pinned memory”.

We also changed the code to print the “KernExecTimeOutEnabled” value in CUDA properties and it is “disabled” (0) for both the GPUs. The ULF happens on either GPU of GTX295.

Check the failing log here:

hpc@ubuntu:~/SPWorley$ ./multi 0 & ./multi 1 & ./multi 0 & ./multi 1 & ./multi 1 & ./multi 0 &

[1] 2795

[2] 2796

[3] 2797

[4] 2798

[5] 2799

[6] 2800

Starting thread for compute on device 0 : GeForce GTX 295 with timeoutEnabled = 0

All threads launched

Starting thread for compute on device 1 : GeForce GTX 295 with timeoutEnabled = 0

All threads launched

Starting thread for compute on device 0 : GeForce GTX 295 with timeoutEnabled = 0

All threads launched

Starting thread for compute on device 1 : GeForce GTX 295 with timeoutEnabled = 0

All threads launched

Starting thread for compute on device 1 : GeForce GTX 295 with timeoutEnabled = 0

All threads launched

Starting thread for compute on device 0 : GeForce GTX 295 with timeoutEnabled = 0

All threads launched

Device 1 initialized successfully, starting compute.

Device 0 initialized successfully, starting compute.

Device 1 finished compute.

Device 0 finished compute.

Device 1 initialized successfully, starting compute.

Device 0 initialized successfully, starting compute.

Kernel Launch Error on device 0 - Failed with error-code 4: unspecified launch failure

All threads completed.

Device 0 completed

Kernel Launch Error on device 1 - Failed with error-code 4: unspecified launch failure

[6]+ Done ./multi 0

I tailored the code further not to use “pthreads” and run code only for 1 device. This code is a bit more stable than the “threaded” counterparts… But finally managed to fail with an ULF below.

(tried after fresh reboot of the machine)

hpc@ubuntu:~SPWorley$ ./ulf_nothreads.sh

Starting thread for compute on device 0 : GeForce GTX 295 with timeoutEnabled = 0

Starting thread for compute on device 1 : GeForce GTX 295 with timeoutEnabled = 0

Device 0 initialized successfully, starting compute.

Device 1 initialized successfully, starting compute.

Starting thread for compute on device 0 : GeForce GTX 295 with timeoutEnabled = 0

Device 1 finished compute.

Device 1 completed

Device 0 finished compute.

Device 0 completed

Starting thread for compute on device 0 : GeForce GTX 295 with timeoutEnabled = 0

Starting thread for compute on device 1 : GeForce GTX 295 with timeoutEnabled = 0

Starting thread for compute on device 1 : GeForce GTX 295 with timeoutEnabled = 0

Device 0 initialized successfully, starting compute.

Device 0 finished compute.

Device 0 completed

Device 0 initialized successfully, starting compute.

Device 1 initialized successfully, starting compute.

Device 0 finished compute.

Device 0 completed

Device 1 finished compute.

Device 1 completed

Device 1 initialized successfully, starting compute.

Kernel Launch Error on device 1 - Failed with error-code 4: unspecified launch failure

Device 1 completed

We also tried adding “cudaThreadExit” at the end of localThreadFunction(). But that did not help.

ULF still happen (tried after fresh reboot of the machine)

Able to simulate the problem on a single-GPU system - TESLA C1060 (after a fresh reboot).

We used the non-threaded version of SPWorley test running multiple instances of device 0

(only 1 GPU in the system). Even this test failed.

Strange!

Sarnath · February 3, 2011, 2:31pm

All,

I somehow suspect that steve’s repro case will pass on Multi-GPU TESLA based systems.
If any of you have access to, can you please try and let us know?

We are trying to get an Amazon GPU instance… but will take a few days…
Would love to hear an interim update from any of you… Thanks in advance!

Thanks,
BEst regards,
Sarnath

Sarnath · February 7, 2011, 6:46am

SPWorley test fails even on single-GPU (TESLA C1060) system. Please see 2 posts above (have edited the post)

Sarnath · February 8, 2011, 10:06am

I submited a bug repot at https://nvdeveloper.nvidia.com/
It has been a week now and there has been no response frm NVIDIA…
Is this normal? OR Have I submitted in some irrelevant site?

Topic		Replies	Views
Failure with independent devices on independent processes Try it yourself! CUDA Programming and Performance	19	3637	March 10, 2011
Multiple simultaneous CUDA applications (system crash on 100.14.11) CUDA Programming and Performance	14	12697	October 8, 2007
intermittent killer kernel Kernel which causes CUDA to die, followed by launch failures CUDA Programming and Performance	36	35160	June 12, 2009
Synchronization hangs sporadically after kernel launch CUDA Programming and Performance	23	8078	August 20, 2015
Random ULF for simple kernel call in loop CUDA Programming and Performance	6	6384	May 3, 2008
Different cuda kernels reports 'unspecified launch failure' crashes at the same time CUDA Programming and Performance	3	445	October 5, 2023
code that crashes unpredictably CUDA Programming and Performance	15	12790	April 28, 2010
CUDA 2.0 seems to fail for long executions multiple process on one card fail CUDA Programming and Performance	5	7501	June 16, 2008
If you were a program and you would only run sometimes... your problem would be?? CUDA Programming and Performance	4	3970	August 2, 2009
Unspecified launch failure error CUDA Programming and Performance	10	19518	January 6, 2018

Un-specified Launch Failures on CTRL_C Driver corrupting contexts ??

Related topics