"unspecified driver error"

Hi,

Is there any way to get more information about an “unspecified driver error” message returned by cudaGetErrorString? I get this message during a certain device-to-host memcpy. Sometimes the code works as expected, sometimes the error message is produced, and other times the host immediately reboots. I’ve verified in the emulator that my kernel isn’t writing outside of allocated memory or doing anything obviously wrong, so I need more information about the error condition.

Thanks-
Abe

Can you provide a simple app to reproduce the problem?

Thanks,
Mark

I am having similar problems with the x264 codec I am experimenting with. The cudaMemcpy() call is failing with the same error message on a cudaMemcpyDeviceToHost request when I am running on the GPU. In emulation mode, it works fine.

I don’t really want to upload the whole tar file because of the size but I would be happy to send it to anyone investigating the problem. The function looks like this and the error occurs on first call of the 4 calls to cudaMemcpy() at the end of code fragment. The error message is “Cuda error in file ‘common/cuda/mc.cu’ in line 104 : unspecified driver error.” which is the first of the last 4 calls at the end of this code fragment.

In my case, the cudaMemcpyDeviceToHost fails consistently. Could very well be something I am doing wrong (newbie) but I haven’t figured out what yet.

extern "C" void x264_frame_init_lowres_nVidia( x264_t *h, x264_frame_t *frame ) {

   int i = 0;

    int i_padv = PADV << h->param.b_interlaced;

    float exec_time;

    unsigned int timer = 0;

   /* size of  frame data (+64 for extra data for me) - code borrowed from frame.c */

   CUT_SAFE_CALL( cutCreateTimer( &timer));

    CUT_CHECK_DEVICE();

    CUT_SAFE_CALL( cutStartTimer(timer));

   /* Allocate a copy of the frame to be initialized on the

     device. Only allocate plane[0] since the lowres init algorithm

     only uses plane[0]

    */

   int i_divh = 1;	// I420 only

    int i_divw = 1;

   /* allocate and copy plane[0] to GPU */

   uint8_t  *d_buffer = NULL;

    unsigned int d_bufsize = frame->i_stride[0] * (frame->i_lines[0] + 2*i_padv/i_divh);

   CUDA_SAFE_CALL( cudaMalloc( (void**) &d_buffer, d_bufsize ));

   uint8_t *d_plane0 = ((uint8_t*)d_buffer) + 

                               frame->i_stride[0] * i_padv / i_divh + PADH / i_divw;

   CUDA_SAFE_CALL( cudaMemcpy( d_buffer, 

    frame->buffer[0], 

    d_bufsize, 

    cudaMemcpyHostToDevice) );

   /*	Allocate lowres buffer of the right size */

   uint8_t* d_lowres_plane[4];

    uint8_t* d_lowres_buffer[4];

    const int i_lowres_buffer_size = frame->i_stride_lowres * 

                                     (frame->i_lines[0]/2 + 2*i_padv);

   for (i=0; i<4; i++) {

     CUDA_SAFE_CALL( cudaMalloc( (void **) &d_lowres_buffer[i], i_lowres_buffer_size) );

      d_lowres_plane[i] = ((uint8_t *) (d_lowres_buffer[i])) + frame->i_stride_lowres * i_padv + PADH;

    }

   /* 

       Set up the actual call to the GPU to propagate plane[0] to lowres[]

      Total number of threads required = (frame->istride_lowres-64) * 

                                          frame->i_lines_lowres

    */

   unsigned int required_threads = (frame->i_stride_lowres - 64)*frame->i_lines_lowres;

   const unsigned int threads_per_block = 256;

    unsigned int blocks_required = ((required_threads + threads_per_block -1) & (-threads_per_block))/threads_per_block;

   dim3 threads( threads_per_block, 1);

    dim3 grid(blocks_required, 1);

    x264_frame_init_lowres_kernel<<< grid, threads>>>( d_plane0, 

             d_lowres_plane, 

             frame->i_stride[0], 

             frame->i_stride_lowres,

             frame->i_lines_lowres,

             required_threads);

   /* Copy back lowres[] */

   for (i=0; i < 4; i++) {

     CUDA_SAFE_CALL( cudaMemcpy( frame->buffer_lowres[i], 

      d_lowres_buffer[i], 

      i_lowres_buffer_size, 

      cudaMemcpyDeviceToHost) );

      CUDA_SAFE_CALL( cudaFree( d_lowres_buffer[i] ));

    }

    CUDA_SAFE_CALL( cudaFree(d_buffer) );

   CUT_SAFE_CALL( cutStopTimer( timer ));

    exec_time = cutGetTimerValue( timer );

    CUT_SAFE_CALL( cutDeleteTimer(timer) );

   printf("x264_frame_init_lowres_nVidia: %f ms\n", exec_time);

}

I’ll attempt to further isolate the problem and file a bug. Most of my simple examples work.

Abe

Spencer, something looks wrong in your code: You pass d_lowres_plane, which is a host pointer, to x264_frame_init_lowres_kernel(), presumably dereferencing it in device code (and this is forbidden (see last sentence of Section 4.2.2.4)).

I think you want d_lowres_plane to be a device pointer and fill the memory it points to using cudaMemcpy().

Cyril,

Thanks. I am surprised that the nvcc compiler did not generate an error. Part of my confusion is that when I originally declared the variable to be device inside the function, the nvcc compiler gave me compile errors so I removed the declaration and compiler was happy.

Unfortunately, I moved the scope of the device variables to outside of the function, nvcc is happy but I still have the same run time error as before. :(

If you make d_lowres_buffer a device variable, then cudaMalloc( (void **) &d_lowres_buffer[i], i_lowres_buffer_size) will fail because cudaMalloc() can’t write to device memory.
Same pb and solution as this topic.

We will try and improve nvcc to detect such pointer issues.

Cyril

How would I do pointer calculations on device variables? For example, consider this code fragment

   const int i_lowres_buffer_size = frame->i_stride_lowres * 

                                     (frame->i_lines[0]/2 + 2*i_padv);

   uint8_t* d_lowres_buffer[4];

    uint8_t* d_lowres_plane[4];

   CUDA_SAFE_CALL( cudaMalloc( &d_lowres_buffer, sizeof(uint8_t *)*4 ));

   for (i=0; i<4; i++) {

     CUDA_SAFE_CALL( cudaMalloc( (void **) &d_lowres_buffer[i], i_lowres_buffer_size) );

      d_lowres_plane[i] = ((uint8_t *) (d_lowres_buffer[i])) + frame->i_stride_lowres * i_padv + PADH;

    }

I’ve changed the cudaMalloc calls as you suggested but I don’t think I am allowed to do pointer calculation like I need to with d_lowres_plane[i] on the host. Do I need to do all of this in the kernel function?

You can do device pointer arithmetic on the host and copy the result to the device.

I think my system instability can be attributed to memory exceptions also. (At least I eliminated the symptom by adding some bounds checking to array access in the kernel.) What is supposed to happen if the kernel writes to something way off the end of an array? Surely the host shouldn’t reboot.

-Thanks
Abe

Hi, I’ve finally been able to reproduce the behavior in a simple app. The call cudaGLUnmapBufferObject returns code 10201. Oddly the first couple times the kernel is invoked everything works as expected. The kernel reads pixels from one PBO and adds them to pixels from another PBO which is then copied to the screen. (I think my gl code is very similar to the post processing example) The kernel contains one loop which executes eight times, so I don’t think a case of a windows timeout.

I’m still experiencing host instability. If I run the simple program once, I get the error message (which triggers an assertion failure). If I restart the program, most of the time the host crashes. I seem to get one free run after rebooting :-), but then the host crashes on subsequent runs. The host seems to crash more frequently if the program is run outside of a debugger. Occasionally I see some sort of hex filled blue screen before the reboot, but I can’t read it quickly enough.

Abe

I added a lot of range checking code to my kernel code to make sure I am writing within my output and not reading off the end of the buffers. Running on the emulator, everything is fine. No reading or writing from out of bounds locations.

But when I run it on the device, the same error occurs! The cuda return code from cudaMemcpy() is 10201 like abestephens is getting. My GPU does not lock up but the memory copy from device to host has not succeeded ever(!) after a successful kernel call.

I tried to comment out my kernel call. Everything works beautifully which means it must be something related to the kernel call but there is basically no way of figuring out what the error is the kernel (if there is any!). Assert() is not supported, stdio is not supported which is fair enough since there is no I/O devices attached BUT there are no standardized mechanisms to set any application return code. What a pain!!! :angry:

The kernel algorithm appears fine since the output I get in emulation mode matches my reference implementation output exactly (md5sum matches).

Now I must sit down and write a per thread execution tracing/error reporting mechanism in the hope catching some error (or maybe I will try RapidMind instead)

I’ve attributed this behavior to a drawpixels command copying from the pbo after it is unmapped/unregistered from Cuda. I switched to rendering a full screen textured quad as the post processing example shows and everything worked.

Abe

Did you check the error code from the kernel invocation (rather than memcpy)? What is it?

I’m told that the only way to get error code 10201 after a launch (rather than a memcpy) is because the launch took too long, the watchdog timer kicked in, and CUDA bails out. How long does your kernel execution take?

Correct execution in emulation is NOT a strong indicator of correct code for the GPU. You could have a __syncthreads() in the wrong place, or you could be trying to dereference a host pointer on the device (or vice versa), neither of which is prevented in all cases by the emulator.

If you can provide a repro example, we should be able to help.

Mark

Hi Mark,

After lots of trial and error and head scratching, I’ve got my kernel to work last night. :magic: It ultimately came down to device vs. host pointers. Thanks to Cyril for his suggestions. I will make the code available further on once I have a chance to do some performance tuning and write some more kernels. As x264 codec is GPL’ed code, there are no IP issues on my part. I just want to have more done before I show something.

My biggest frustration has been the difficulty with debugging. I did figure out the kernel was failing with a launch failure (RC=3) because my kernel aborted. After that it was a matter of searching for the failure point using RETURN statements which turned out to be trying to dereference a host pointer.

Is it possible to document all these return codes? 3? 10201?

Is there a way to have the CUDA driver to return some context information on failure (file + offset) in the future? Having to sprinkle RETURN statements all over to figure out where things failed is positively baroque. I rather use a logic analyzer. :-)

As for emulator inaccuracies, what it showed me was my basic algorithm mapping was correct which was crucial to allow me to look at other potential problems. What I would really like to have to be able to use is a cycle-accurate simulator. The Intel network processor SDK has those. It runs really slowly but it works.

Spencer

Hi Spencer,

Yes, we’re aware that debugging is currently painful, and we’ll be working on better debugging tools in the future. Thanks for your suggestions.

Does cudaGetErrorString() not work for you?

You can write a macro to do this, much like we have done in the cutil library (CUT_CHECK_ERROR, CUT_SAFE_CALL).

This would possibly require as much work as building a hardware debugger, which we think would be more worthwhile.

Mark

Hi Mark,

That would really be great. For non-scientific programming like what I am trying to do, the algorithms are not always well defined.

cudaGetErrorString() doesn’t always return meaning errors. For example RC=3 from kernel launch just says (from memory) something like launch failed which I would interpret to mean the kernl failed to run because of resource constraints, not that it had aborted because of a bug in the kernel code. 10201 means? Some explanation of the error codes, their causes and what we are suppose to do about them would help.

Hardware debugger hooked into GDB would be great if your team can do it.

Spencer

Hi,

I am getting an unspecified driver error when I try to initialize the Direct3D interop. So what fails for me is the instruction

CUDA_SAFE_CALL(cudaD3D9begin(pd3dDevice));

To ensure that I did not do something wrong, I tried running the simpleD3D and the fluidsD3D samples. But these samples throw the same error too.

Interestingly, if I restart my machine, and start with the samples, they run fine. But once I run my code (which essentially just attempts to initialize the D3D device and bring up a blank D3D window) it triggers this error for any D3D interop application after that.

Any help appreciated. Thanks.

  • tanmay

P.S. : simpleD3D line 133 is where the error comes from in that project.