Un-specified Launch Failures on CTRL_C Driver corrupting contexts ??

Hi All,

I am running the simple code below (dotProduct is a misnomer) on the default device of the GTX295 (not connected to display)

I run 5 instances of this code in background (./dotP & ./dotP & ./dotP & ./dotP & ./dotP &)

When I bring one of these processes to foreground and press CTRL_C, the system hangs for a few seconds and 2 instances die. One of them reports an ULF (unspecified launch failure).

It looks like the driver is corrupting the context of another running instance.

I can positively reproduce this case in my setup here. Sometimes, it does not happen on the first CTRL_C. But eventually happens before all the CTRL_Cs are exhausted (bring one by one to foreground(fg) and then CTRL_C)

We fear that this driver behavior could exist even in normal termination path (without pressing CTRL_C).

But we don’t have any solid evidence at the momment.

Request NVIDIA to look into this,

Many THANKS!

System info:

============

Ubuntu Lucid 10.04 x86_64

CUDA 3.2, Driver version: 260.19.26

nvcc -O2 -o dotP dotP.cu

GTX 295

#include <stdio.h>

#define MAX_N (4*1024*1024)

#define NUM_RUNS (10000)

#define ERR_CHECK(cuda_fn) \

    {\

    cudaError_t err = cuda_fn;\

    if (err != cudaSuccess) \

    {\

        printf("CUDA Error: Line %d : %s\n", __LINE__ , cudaGetErrorString(err));\

        exit(-1);\

    }\

    }

__global__ void dotProductGPU(float *a, float *b, float *c, int N)

{

    int idx = blockIdx.x*blockDim.x + threadIdx.x;

    float a1,b1;

    float sum;

for(int i= idx; i<N; i+=blockDim.x*gridDim.x)

    {

        sum = 0;

        a1 = a[idx];

        b1 = b[idx];

for(int j=0; j<100; j++)

        {

            sum += sqrtf(a1*a1 + b1*b1 - a1*b1 -b1*a1);

        }

c[idx] = sum;

    }

    return;

}

int doDotProduct(void)

{

    float *aGPU, *bGPU, *cGPU;

printf("Doing Dot Product\n");

    ERR_CHECK(cudaMalloc(&aGPU, MAX_N*sizeof(float)));

    ERR_CHECK(cudaMalloc(&bGPU, MAX_N*sizeof(float)));

    ERR_CHECK(cudaMalloc(&cGPU, MAX_N*sizeof(float)));

    //cudaMemset(aGPU, 0, MAX_N*sizeof(float));

    //cudaMemset(bGPU, 0, MAX_N*sizeof(float));

    for(int i=0; i<NUM_RUNS; i++)

    {

        dotProductGPU<<<1000, 96>>>(aGPU, bGPU, cGPU, MAX_N);

    }

    ERR_CHECK(cudaThreadSynchronize());

ERR_CHECK(cudaFree(aGPU));

    ERR_CHECK(cudaFree(bGPU));

    ERR_CHECK(cudaFree(cGPU));

    return 0;

}

int main(void)

{

    doDotProduct();

    return 0;

}

Is any1 able to simulate the problem above?

NVIDIA,
Can you reproduce this issue? I am eager to know the other scenarios that can expose the bug…(if this is a bug in the driver)

What you are seeing sounds suspiciously like this, although I had it in my head that it was fixed in the 3.2 driver cycle.

May be… But our code is “not” dependent on matlab or exclusive mode though…
Lets see if NVIDIA can reproduce the issue…

Elsewhere, SPWorley (Cudesnick thread on system reboots - CUDA on Linux - Resolved to be a PSU issue) had mentioned that CUDA drivers had a problem with Multiple GPU applications running simultaneously. But he said that NVIDIA resolved those issues long back.

I think this problem pertains to that family.

Any case, I observed even wilder results today. I was running many instances of SDK samples (radixSort, simpleCUBLAS, convolution and other custom code) and ran this “dotP” test. None of the dotP tests failed although the system went temporarily un-responsive for a couple of times… But it caused seg-faults and ULFs on radixSort, simpleCUBLAS and other custom code.

I have raised a Bug Report with NVIDIA. Let us see what happens,

Any one?

It’s this bug I posted about last year. It’s still outstanding.

It’s pretty annoying in practice, since it means you can’t do development on a machine when you have important background compute happening on other GPUs.
If you ^C out of your test kernel, your other processes will die, losing your work.

Steve,

Thanks for this! I ran your code on both GPUs of a single GTX295 (single process) and simultaneously did “nvidia-smi -a”.

Here is the result of “nvidia-smi”. We were also running lot other GPU apps in the background…

But never we have seen such baffling nvidia-smi output. Look at the GPU usage.

“nvidia-smi -a” appeared hung until the 2 processes exited.

system: NVIDIA CUDA 3.2, 260.19.26, x86_64, Ubuntu 10.04 Lucid, GTX 295

Steve,

I am listing all the observations I have made regarding your test case (Thanks for this one)

I was able to reproduce your problem on my machine at the first attempt.

I changed the code to analyze the “error” code and print the error string.

The machine config is the same (10.04 Ubuntu 64-bit, Lucid, CUDA 3.2, 260.19.26)

We were able to reproduce the ULF using a “cudaMalloc” approach (each thread allocates) instead of portable-pinned memory. This time no other GPU instances were running and the system was fresh after a reboot. So, I think this problem has nothing to do with “pinned memory”.

We also changed the code to print the “KernExecTimeOutEnabled” value in CUDA properties and it is “disabled” (0) for both the GPUs. The ULF happens on either GPU of GTX295.

Check the failing log here:

  1. I tailored the code further not to use “pthreads” and run code only for 1 device. This code is a bit more stable than the “threaded” counterparts… But finally managed to fail with an ULF below.

(tried after fresh reboot of the machine)

We also tried adding “cudaThreadExit” at the end of localThreadFunction(). But that did not help.

ULF still happen (tried after fresh reboot of the machine)

Able to simulate the problem on a single-GPU system - TESLA C1060 (after a fresh reboot).

We used the non-threaded version of SPWorley test running multiple instances of device 0

(only 1 GPU in the system). Even this test failed.

Strange!

All,

I somehow suspect that steve’s repro case will pass on Multi-GPU TESLA based systems.
If any of you have access to, can you please try and let us know?

We are trying to get an Amazon GPU instance… but will take a few days…
Would love to hear an interim update from any of you… Thanks in advance!

Thanks,
BEst regards,
Sarnath

SPWorley test fails even on single-GPU (TESLA C1060) system. Please see 2 posts above (have edited the post)

I submited a bug repot at https://nvdeveloper.nvidia.com/
It has been a week now and there has been no response frm NVIDIA…
Is this normal? OR Have I submitted in some irrelevant site?