stranges kernel crashes on (apparently) trivial operations...

hello everybody,

I’m encountering stranges problems on a cuda program that i’m doing, I hope that somebody can tell me of similar experiences :

I process images using 3 filters, the ouput of a pixel being a mix of the 3 filter outputs, so I use 3 device function to get my responses :

int x = filter1Response(…);
int y = filter2Response(…);
int z = filter3Response(…);

I tested each response separaterly and they seems ok, but next, as I’m using 1 thread per pixel in my image, I try to affect the output directly :

odata[blockIdx.x * blockDim.x + threadIdx.x] = x * y + z * z;

and the program doesn’t work at all : no visible error, but each cudaMemcpy seems to get in memory the result of
the previous succesful execution.

but if I try something like that :

odata[blockIdx.x * blockDim.x + threadIdx.x] = x;
odata[blockIdx.x * blockDim.x + threadIdx.x] = odata[blockIdx.x * blockDim.x + threadIdx.x] * y;

i can see the right output for xy… if I try to use local variables to store the values, I find the same problem, but I can’t rely only on odata to store the values for this kind of operation…

I’m really without a clue here, if somebody can give me advice, I would be grateful…

is odata also an integer type ?

Also strange errors is part of the cuda fun :) get used to it.

I just solved one myself today.

Yes , same type… I’m just working on CUDA since 2 months, I hope that the framework is a bit more reliable than this… Event if I changed everything, there is no garantee that it would work…

Are there, by any chance, some factor that can disrupt execution of kernel? (previous runs, desktop activity…)

Yes there a lot of things which can disrupt proper execution of the kernel. What exactly you mean by “some factor” ? Previous runs mite effect if you don’t delete your GPU arrays before you exit the program. Memory corruption happens for me if I do that,… end up restarting the computer

ALso can put your code here where the kernel messes up ? generally I have learned from more veterans here in the forum that its a bug in the code rather then some hardware fault which causes kernels to fail.

I wouldn’t say that CUDA Is not reliable, but yes it seems like that when work on it initially. But once you get a hang of it , it open a range of possibilities for ur computational needs (sounds cheesy ;) ).

I know people using CUDA for professional development and it works great for them, though they must be using the TESLA compute system I suppose.

Yes, I’m sure too that NVIDIA personnel can write a compiler :)

I may have found what went wrong : this kernel uses a certain number of local variables (~30) : I deplaced some to shared memory (as I intended to do later, for performance gains) and I found that my algorithm was working!

My theory is that the local processor memory is microscopic and I depleted it quickly so there wasn’t any memory to compute temporary variables for arithmetic operations, therefor big failure.

Maybe somebody with more experience concerning the devices architectures can confirm/infirm…

In all likelihood, the kernel version which “fails” wasn’t actually running at all. There is finite limits for the total register count for a running block - 8192 for compute 1.0 and 1.1 devices, and 16384 for compute 1.2 and 1.3 devices. You kernel register count (obtainable by passing --ptxas-options="-v -mem" to nvcc during compilation) times the total threads per block must be lower than this limit. By moving some of the intermediate data to shared memory or global memory, you are reducing the register count and the kernel runs. The simplest solution is to reduce the number of threads per block.

All this is very clearly described in Chapter 5 and first appendix of the programming guide which comes with the CUDA toolkit.