Hi,
I’m trying to follow the conventional error recovery path (cudaThreadExit()), and cannot get it to work in certain scenarios. The example minimal test case code is included below.
What I’ve found, is that after a kernel crash, cudaThreadExit() succeeds, but the next operation on the same device can fail. This happens if some memory has been allocated on the device prior to the crash, but does not happen if no memory has been allocated prior to the crash.
The code was run on Ubuntu 8.04, cuda 2.2, NVidia GTX280. Build instructions: nvcc crash.cu -o crash && ./crash
The logic of the code is as follows:
Step 1: allocate some memory on the device, prior to crashing.
Step 2: crash CUDA using an officially bad kernel.
Step 3: recover after the CUDA crash.
Step 4: allocate some memory on the device after the post-crash recovery.
Step 4 of this code fails, but it succeeds if either of the steps 1 or 2 is removed from the code.
My goal is to get step 3 of the code to work somehow. I would very much appreciate any advice.
Thanks!
[codebox]
// This file reproduces inability to recover from a kernel crash using
// cudaThreadExit(…) on CUDA v.2.2, when the device memory is allocated prior to
// the crash. The recovery is successful, if the device memory is NOT allocated
// prior to the crash (to reproduce, comment out the section from Step 1 and up
// to Step 2 of the following code). The code succeeds, if the kernel is not
// invoked (to reproduce, comment out the section from Step 2 and up to Step3 of
// the following code).
// Save this file as crash.cu. To compile on a Linux system, CUDA v.2.2, invoke:
// nvcc crash.cu -o crash && ./crash
#include <stdio.h>
#include <cuda.h>
// This kernel crashes CUDA.
global void crash(int* hats) {
// According to tmurray on Nvidia CUDA board
// http://forums.nvidia.com/index.php?showtop...entry5975</a>
19,
// the following should crash really hard, provided the pointer is really bad.
*hats = 12;
} // of crash(…)
int main() {
cudaError_t status;
// Step 1: allocate some memory on the device, prior to crashing.
float* d_pre;
status = cudaMalloc((void**) & d_pre, 10 * sizeof(float));
if (cudaSuccess != status) {
printf("Failed to allocate memory on the device, prior to the crash.\n");
exit(1);
}
// Step 2: crash CUDA using an officially bad kernel.
static const dim3 GRID(1);
static const dim3 THREADS(1);
// This kernel call is expected to crash. The device pointer argument passed
// to the kernel is a really bad pointer, according to tmurray (see reference
// above).
crash<<<GRID, THREADS>>>((int*)0xffffffff);
// This is expected to fire.
if (cudaSuccess == cudaThreadSynchronize()) {
printf("Failed to crash CUDA.\n");
exit(1);
}
printf(“Crashed CUDA, this is OK.\n”);
// Step 3: recover after the CUDA crash.
// The official way to recover after the crash.
status = cudaThreadExit();
if (cudaSuccess != status) {
printf("Failed in cudaThreadExit(..).\n");
exit(1);
}
// Step 4: allocate some memory on the device after the post-crash recovery.
float* d_post;
status = cudaMalloc((void**) & d_post, 20 * sizeof(float));
if (cudaSuccess != status) {
printf("Failed to allocate memory on the device after the crash.\n");
exit(1);
}
printf(“Completing successfully.\n”);
return 0;
} // of main(…)
[/codebox]