CUDA Error: Invalid Device Function Debugging CUDA errors

Hi. I am working on a file converter program that will use CUDA to accelerate the process by using one thread per file. Currently, the program works fine in device emulation mode. However, when I compile without emulation mode, it runs fine and doesn’t crash but the image generated by the conversion is not right. I checked the output in VS2008, and found:

I’m immediately confused. Don’t C++ exceptions if not caught terminate the program? Ignoring this, In every one of my cuda function calls, I check the return value to make sure there is no error. And from these checked functions none of them printed out they failed. I choose to break on C++ exceptions, and found that the first exception occurred at cudaMalloc():

// The unfiltered file contents of each hnd file

void* d_hnds;

uint* d_hndsOffsets;

uint* hndsSizes;

if (cudaMalloc((void**)&d_hndsOffsets, sizeof(uint) * numFiles))

{

	printf("cudaMalloc() failed\n");

	return;

}

However, even though there was a C++ exception the return value of cudaMalloc was 0 (success). I wanted to get the cudaError from the exception, so I tried to catch it; but I couldn’t. I moved on to the second exception, and found it occurred at a kernel call:

hndsToRawsKernel<<<gridSize, blockSize>>>(d_hnds, d_hndsOffsets, (uint32*)d_raws, rawFileSize);

I used CUT_CHECK_ERROR() before and after the call. Before the call, it reported no error. After the call it printed out:

I searched for what this error code meant, and I didn’t find any helpful information. Any help is greatly appreciated.

The two most common reasons (that I’ve seen, anyway) for getting errors when you move from emulation to actually running code on the device are: (1) accidentally passing a host pointer to a device method, or vice versa; and (2) race conditions in your kernel.

It usually helps if you post a bit more code…but perhaps you meant to cudaMalloc the d_hnds pointer, instead of d_hndsOffsets? If not, attach some more code to your post and maybe someone can help you find the problem.

I did mean d_hndsOffsets. The program works like this: I read in all of the hnd files to convert (to raw), and store their contents in one big block of device memory. I pass the pointer to this memory to the kernel, along with a block of device memory holding the offsets of the different hnd files (not every hnd file’s size is the same). Previously I had been using a pointer-to-pointer scheme referencing each hnd file, but I thought it could be the reason for my problems so I removed it. So far, I’ve re-created the project file using the wizard at http://sourceforge.net/projects/cudavswizard/ (CUDA_VS_Wizard_W32.2.0.zip) and it still doesn’t work. I’ve also made sure I don’t do any dereferencing of device pointers in host code, or pointer arithmetic of device pointers in host code. I no longer have any C++ exceptions at all, and the errors have disappeared (including the “Invalid Device Function”), but the image result is still incorrect (and the result produced by emulation mode still works). I will attach the entire source file here to see if anyone can help me. Any help is greatly appreciated.
hnd_to_raw_cuda.cu.txt (8.57 KB)

Here is some information on my setup which I did not include before:

OS: Windows XP SP2
CPU: Intel Core 2 Duo E6550 @ 2.33GHz
GPU: GeForce 9500 GT
RAM: 3.25 GB
CUDA Version: 2.3
Compiler for Host code: Microsoft Visual C++ 2008 Express Edition