stranges kernel crashes on (apparently) trivial operations...

Mazhe · July 21, 2009, 7:00am

hello everybody,

I’m encountering stranges problems on a cuda program that i’m doing, I hope that somebody can tell me of similar experiences :

I process images using 3 filters, the ouput of a pixel being a mix of the 3 filter outputs, so I use 3 device function to get my responses :

int x = filter1Response(…);
int y = filter2Response(…);
int z = filter3Response(…);

I tested each response separaterly and they seems ok, but next, as I’m using 1 thread per pixel in my image, I try to affect the output directly :

odata[blockIdx.x * blockDim.x + threadIdx.x] = x * y + z * z;

and the program doesn’t work at all : no visible error, but each cudaMemcpy seems to get in memory the result of
the previous succesful execution.

but if I try something like that :

odata[blockIdx.x * blockDim.x + threadIdx.x] = x;
odata[blockIdx.x * blockDim.x + threadIdx.x] = odata[blockIdx.x * blockDim.x + threadIdx.x] * y;

i can see the right output for xy… if I try to use local variables to store the values, I find the same problem, but I can’t rely only on odata to store the values for this kind of operation…

I’m really without a clue here, if somebody can give me advice, I would be grateful…

nitin.life · July 21, 2009, 9:44am

is odata also an integer type ?

Also strange errors is part of the cuda fun :) get used to it.

I just solved one myself today.

Mazhe · July 21, 2009, 2:48pm

Yes , same type… I’m just working on CUDA since 2 months, I hope that the framework is a bit more reliable than this… Event if I changed everything, there is no garantee that it would work…

Are there, by any chance, some factor that can disrupt execution of kernel? (previous runs, desktop activity…)

nitin.life · July 21, 2009, 3:50pm

Yes there a lot of things which can disrupt proper execution of the kernel. What exactly you mean by “some factor” ? Previous runs mite effect if you don’t delete your GPU arrays before you exit the program. Memory corruption happens for me if I do that,… end up restarting the computer

ALso can put your code here where the kernel messes up ? generally I have learned from more veterans here in the forum that its a bug in the code rather then some hardware fault which causes kernels to fail.

I wouldn’t say that CUDA Is not reliable, but yes it seems like that when work on it initially. But once you get a hang of it , it open a range of possibilities for ur computational needs (sounds cheesy ;) ).

I know people using CUDA for professional development and it works great for them, though they must be using the TESLA compute system I suppose.

Mazhe · July 23, 2009, 10:52am

Yes there a lot of things which can disrupt proper execution of the kernel. What exactly you mean by “some factor” ? Previous runs mite effect if you don’t delete your GPU arrays before you exit the program. Memory corruption happens for me if I do that,… end up restarting the computer

ALso can put your code here where the kernel messes up ? generally I have learned from more veterans here in the forum that its a bug in the code rather then some hardware fault which causes kernels to fail.

I wouldn’t say that CUDA Is not reliable, but yes it seems like that when work on it initially. But once you get a hang of it , it open a range of possibilities for ur computational needs (sounds cheesy ;) ).

I know people using CUDA for professional development and it works great for them, though they must be using the TESLA compute system I suppose.

Yes, I’m sure too that NVIDIA personnel can write a compiler :)

I may have found what went wrong : this kernel uses a certain number of local variables (~30) : I deplaced some to shared memory (as I intended to do later, for performance gains) and I found that my algorithm was working!

My theory is that the local processor memory is microscopic and I depleted it quickly so there wasn’t any memory to compute temporary variables for arithmetic operations, therefor big failure.

Maybe somebody with more experience concerning the devices architectures can confirm/infirm…

avidday · July 23, 2009, 11:07am

In all likelihood, the kernel version which “fails” wasn’t actually running at all. There is finite limits for the total register count for a running block - 8192 for compute 1.0 and 1.1 devices, and 16384 for compute 1.2 and 1.3 devices. You kernel register count (obtainable by passing --ptxas-options=“-v -mem” to nvcc during compilation) times the total threads per block must be lower than this limit. By moving some of the intermediate data to shared memory or global memory, you are reducing the register count and the kernel runs. The simplest solution is to reduce the number of threads per block.

All this is very clearly described in Chapter 5 and first appendix of the programming guide which comes with the CUDA toolkit.

Topic		Replies	Views
stranges kernel crashes CUDA Programming and Performance	1	2066	July 20, 2009
Odd error fixed by commenting unrelated line? CUDA Programming and Performance	11	8706	February 17, 2010
Strange behaviour of a kernel function CUDA Programming and Performance	0	4192	March 20, 2008
Strange behaviour of a kernel function CUDA Programming and Performance	2	1578	March 21, 2008
Very strange behaviour. Maybe a bug...? Kernel fails to run strangely, but no errors are reported. CUDA Programming and Performance	5	1102	May 13, 2009
The kernel isn't working CUDA Programming and Performance	9	1112	January 19, 2011
Weird error CUDA Programming and Performance	5	2670	August 27, 2007
Number of threads affecting answer; this should not happen a VERY strange error.. CUDA Programming and Performance	8	2534	July 17, 2009
Strange behaviour of a kernel function CUDA Programming and Performance	2	2441	March 21, 2008
Possible nvcc bug? CUDA Programming and Performance	13	8873	January 9, 2011

stranges kernel crashes on (apparently) trivial operations...

Related topics