cudaThreadExit not working Bug Report

Hi,

I’m trying to follow the conventional error recovery path (cudaThreadExit()), and cannot get it to work in certain scenarios. The example minimal test case code is included below.

What I’ve found, is that after a kernel crash, cudaThreadExit() succeeds, but the next operation on the same device can fail. This happens if some memory has been allocated on the device prior to the crash, but does not happen if no memory has been allocated prior to the crash.

The code was run on Ubuntu 8.04, cuda 2.2, NVidia GTX280. Build instructions: nvcc crash.cu -o crash && ./crash

The logic of the code is as follows:

Step 1: allocate some memory on the device, prior to crashing.

Step 2: crash CUDA using an officially bad kernel.

Step 3: recover after the CUDA crash.

Step 4: allocate some memory on the device after the post-crash recovery.

Step 4 of this code fails, but it succeeds if either of the steps 1 or 2 is removed from the code.

My goal is to get step 3 of the code to work somehow. I would very much appreciate any advice.

Thanks!

[codebox]

// This file reproduces inability to recover from a kernel crash using

// cudaThreadExit(…) on CUDA v.2.2, when the device memory is allocated prior to

// the crash. The recovery is successful, if the device memory is NOT allocated

// prior to the crash (to reproduce, comment out the section from Step 1 and up

// to Step 2 of the following code). The code succeeds, if the kernel is not

// invoked (to reproduce, comment out the section from Step 2 and up to Step3 of

// the following code).

// Save this file as crash.cu. To compile on a Linux system, CUDA v.2.2, invoke:

// nvcc crash.cu -o crash && ./crash

#include <stdio.h>

#include <cuda.h>

// This kernel crashes CUDA.

global void crash(int* hats) {

// According to tmurray on Nvidia CUDA board

// http://forums.nvidia.com/index.php?showtop...entry5975</a>

19,

// the following should crash really hard, provided the pointer is really bad.

*hats = 12;

} // of crash(…)

int main() {

cudaError_t status;

// Step 1: allocate some memory on the device, prior to crashing.

float* d_pre;

status = cudaMalloc((void**) & d_pre, 10 * sizeof(float));

if (cudaSuccess != status) {

printf("Failed to allocate memory on the device, prior to the crash.\n");

exit(1);

}

// Step 2: crash CUDA using an officially bad kernel.

static const dim3 GRID(1);

static const dim3 THREADS(1);

// This kernel call is expected to crash. The device pointer argument passed

// to the kernel is a really bad pointer, according to tmurray (see reference

// above).

crash<<<GRID, THREADS>>>((int*)0xffffffff);

// This is expected to fire.

if (cudaSuccess == cudaThreadSynchronize()) {

printf("Failed to crash CUDA.\n");

exit(1);

}

printf(“Crashed CUDA, this is OK.\n”);

// Step 3: recover after the CUDA crash.

// The official way to recover after the crash.

status = cudaThreadExit();

if (cudaSuccess != status) {

printf("Failed in cudaThreadExit(..).\n");

exit(1);

}

// Step 4: allocate some memory on the device after the post-crash recovery.

float* d_post;

status = cudaMalloc((void**) & d_post, 20 * sizeof(float));

if (cudaSuccess != status) {

printf("Failed to allocate memory on the device after the crash.\n");

exit(1);

}

printf(“Completing successfully.\n”);

return 0;

} // of main(…)

[/codebox]

Dear NVidia engineer, do you mind to please comment on the above post?

Specifically, I would like to know, how I can re-initialize and continue using the device from the same instance of an executable, as the one, which exhibited a Cuda kernel crash.

When I search for cudaThreadExit on google, the very first link, that comes up is:

http://forums.nvidia.com/index.php?showtopic=97490

in which Sarnath reports similar (if not exactly the same) problem on this message board. That post was essentially left unanswered, maybe because Sarnath’s test case was not as simple as mine.

Given, that cuda environment is not 100% stable (e.g. I reliably crash my program every time I restart the X server on this computer, even though this server does not control the device I run my code on), it would be very nice to have a working re-initialization strategy. Given, that the re-initialization of the device is successful, if made from the RESTARTED code (no reboot required between the Cuda crash and the code restart), implementing the cleanup policy within the same instance of an executable, as where the crash occurred seems possible.

Thanks!

My experience is similar.

I had done some testing and here is what I found:

  1. bad kernel + allocated memory

If you can free all the allocated memory, then call cudaThreadExit() will work

  1. Texture usage

If you have used texture in your code, then cudaThreadExit() will not work, regardless you have memory allocated or not.

  1. constant memory usage

Constant memory usage does not affect the thread recovery.

In my multi-gpu code, I used texture memory and there is no way to recover within the thread. So I have to exit the thread,

create a new one and attach to the same GPU. This solved my problem.

-gshi

I have a similar problem …