Kernel is not being launched. SDK kernels get launched. Mine doesn't.

Thank you for taking your time reading this post.

Here is my problem:

I ported a C++ project into CUDA.

I mimicked CUDA SDK, simpleTexture3D

and wrote kernel launch wrapper in cpp code.

I set up the breakpoint at kernel launch. It is hit.

I pressed step in. It ignores kernel function and skips to next line.

CUT_CHECK_ERROR(“Kernel execution failed”); code just stops the program. There must be a kernel launch error.

I tried the same step in simpleTexture3D. The breakpoint is hit and step-in brings me to the kernel function (although one more press gives me no source code.)

Here is how I implemented CUDA.

[codebox]

Callbacks.cpp

extern “C” void launchKernel(…);

void render(){

launchKernel(…);

}

kernel.cu

global void kernel(…) { … }

extern “C” void launchKernel(…){

kernell<<…>>(…);

}

[/codebox]

This seemed all I needed to launch kernel. I’m using CUDA Build Rule v3.0.14

and I’m greatly in need of help.

Thank you.

Okay… my previous blockSize was (32,32,1).

I changed that into (16,16,1), and there I got different output, or rather a message box saying

there is no source code available for the current location.

Does this mean my kernel is running?

Also, I’m passing in float* d_output

and in the kernel, I called d_output[i] = 10; //an arbitrary number.

I called cout << d_output[i] << endl; after the kernel launch wrapper and it prints out 0.

Does this mean the kernel hasn’t been called, since it’s all zero?

This time, the application doesn’t terminate with CUT_CHECK_ERROR(“Kernel execution failed”)

Does this mean anything?

You should do proper memory allocation and copying between host and device. You cannot cout << device memory.

Oh, that was just to see what’s in there, which I believe is not the elegant way of doing so.

It’s not a question of elegance, actually. If you dereference a device pointer on the host (or vice versa), you will either get random garbage or a segfault. They are completely independent memory spaces. You have to cudaMemcpy() between host and device pointers to move this kind of data back and forth.