CUDA kernel not running Kernels on windows XP

I am running CUDA on windows XP with a GEFORCE 9800.
I use visual c++ 2005.
My kernel is running inside a loop.
If I give it a search window parameter of 5, then it works.
However, if I give it a search parameter of 50, it doesn’t have data in the result memory(it become black).
If I get the kernel outside of the loop, and have just one iteration, but with 50 in the parameter, then it works.
So I thought that I am doing too much work, and CUDA just aborts it.
I thought it might be something with the 5 seconds limit, but I am not sure it reach 5 seconds.
However, the loop is running in CPU(isn’t it suppose to be to kernels and not code running on the CPU?)
If it is the 5 seconds limit, then I need to select on which card cuda will run, and have one card without screens attached to it?

Thank you.

Maybe I am missing something, but I believe there are some limitations on the kernel which I am not aware of.
The only limitations I know of, is provding the kernel function with a thread block under 512 threads, and avoid memory leaks.
However, it seems the kernel doesn’t run when done certain calculations, and I don’t know why.
Perhaps the kernel has a limitation of how much memory it can read from?
For instance, doing:
Result[i] = a[i];
or
Result[i] = b[i];
Will work.
But doing
Result[i] = a[i]+b[i];
Will not work.
(Its a simplified example)
So I don’t have any idea why my kernels don’t run, what is the cause, and how can I debug it.

Any help will be appreaciated.

Standard debugging practice is to check for errors after every kernel launch.

Given your simplified example, I’m assuming that you are requesting too many resources for the launch: the 2nd simplified example will use more registers than the first and you are probably exhausing the available registers if you request 512 threads per block. num_thread_per_block * register_usage must be less than 8192 (16384 on G200).

I have added CUT_CHECK_ERROR(“Kernel execution failed”); after each kernel, and it doesn’t do anything. Although I do get a 0 values in the result memory.
I also checked how many registers my kernel use, and it use 20 registers, so I tested it with a block size of 25, and it still give me 0 values.
The thing is, the kernel gives result if I call it once (with a gride size of several blocks and 25 threads in a block)
But if I call the same kernal several times inside a for loop, then it does not give results.
I don’t know why it doesn’t work.

Edit: The problem is even more severe than I thought.
I run a certain code, with some kernels running on release mode.
I view the result and there are results.
Then I run the same code, without even recompiling or anything like that.
Just run it, and I get a black image result. So the calculations failed.
Why is it so inconsistent?

I have made some progress, I guess, but still encounter “weird” problems.

It is possible that my previous problem was due to the fact that I didn’t free all of the CUDA memory I allocated.

Now I have a different, yet similar problem.

I run some CUDA kernal inside a double loop.

If I make the loops like this:

for (int i=0; i<1; i++)

    for (int j=0; j<1; j++)

        { do kernal }

or like this

for (int i=1; i<2; i++)

    for (int j=0; j<1; j++)

        { do kernal }

Then it works.

If I set the loops to be like this:

for (int i=0; i<2; i++)

    for (int j=0; j<1; j++)

        { do kernal }

Then I get the CUDA error “unspecified launch failure”

Why would it work when I run the kernel for i==0 and i==1 but not when I run it for both one after the other?

Check for any out of bounds memory accesses. Your seemingly random problems could be explained by them and they are the most common cause of “unspecified launch failure”.

If you are running on linux, you can compile in emulation mode and run your app through valgrind to find where the out of bounds memory accesses are occuring.