[newbie] float product crashes strange float product that refuses to be computed

Hi all,

I am a newcomer in the CUDA programming community: I am currently testing with my poor Geforce GT 120 if CUDA can be of any help to speed up my math calculations. I am currently writing my first own program, but I observe a mysterious behavior for a quite simple operation…

Few technical details which can be of some importance:

I’m using mac Osx 10.6.4, two Geforce GT 120 cards.

I am coding in C++ in “rotMat3.cu” file, then compiling to cpp:

nvcc -I"/Applications/MATLAB_R2009b.app/extern/include" --machine 32 --gpu-architecture sm_11 --cuda “rotMat3.cu” --output-file “rotMat3.cpp”

After this I use mex compiler (~ g++) to get mex files which can be used by matlab:

mex -I/usr/local/cuda/include -L/usr/local/cuda/lib -lcudart rotMat3.cpp

I ask for help since debugging through mexfiles and cuda looks quite tricky…

I am doing some pretty simple vector calculations (int the spirit of the vectorAdd standard example), with everything stored in global memory for now.

Here is the few lines which are mysterious in my kernel function:

[codebox]//vector initialization

int i;

for (i=0; i<(order+1)(order+1); i++){(outputMat+i) = 0;} //outputMat is float*


//fill vector

iterOutput = outputMat + n1 + m1*(order+1);

float polynom = powf(cosAng, k11+k22)*powf(sinAng, k12+k21)*powf(-1, k12);

float factorials = (normalizerfactK1factK2)/nFact;

(*iterOutput) = (iterOutput)+ polynomfactorials;//problem here


In practice this code does not fill my vector outputMat* at all: it does nothing at all. I suspect that it crashes silently…, that the threads are stopped. But I get no message.

The strange thing is that if I replace the last line by:

codebox = (*iterOutput)+ polynom;//works[/codebox]


codebox = (*iterOutput)+ factorials;//works[/codebox]

output is correctly filled with polynom floats and factorials floats, respectively. I thus suspect that the product polynom*factorials is the source of the crash. But why?

Maybe it’s FMAD issue, but I barely understanding anything at it. I tried to replace the messy product by the _fmul intrisic function, witout any success.

Does anyone has any idea how to fix my issue, and where it comes from?



How long does your kernel take to complete?

Hi, the kernel is very short <1ms, because I’m trying to process small vectors for low complexity parameters… for now.

OK, so it’s not the watchdog. Are you checking error codes? Failed kernels do not print error messages.

How do you deduce the output isn’t written - could it be that the sum is just all zeros?

I deduce than output is not written because the data I copy from device to host looks no initialized: it is generally values from previous runs. The initialization loop:

for (i=0; i<(order+1)*(order+1); i++)


*(outputMat+i) = 0;


had no effect on outputMat. I have also added another vector, with simple threadIdx assignment at the beginning of the kernel:

[codebox]int i = blockIdx.x*blockDim.x + threadIdx.x;

if (i<numAngles)


	d_C[i] = (float) i;


and it is not filled when I copy back d_C to host. I deduce that: either threads are not launched, or they are launched but crash at runtime. Am I right?

The point is that I don’t know how to check if threads are launched, and recover the potential error messages from them. I’m just launching them like this:

rotationMatrix<<<blocksPerGrid, threadsPerBlock>>>(d_angles, d_output_mat, order, N, d_C);

in the same way as in the programming guide of Cuda.

If threads are launched it should indicate a runtime error from this product I guess. Otherwise it could be… compiling issues?..

Both nvcc and mex return no error message… (I can copy/paste log if you want). Or maybe wrong compiling options?

thanks for your help.


The kernel launch itself indeed does not return error codes, as it is asynchronous. The error will be returned by the next CUDA function call, which usually will be the cudaMemcpy() copying results back to the host. You should check the return codes of all CUDA calls though.

Ok. I am already checking cuda errors after the kernel launch like this:

cudaError = cudaMemcpy(h_C, d_C, sizeN, cudaMemcpyDeviceToHost);
if (cudaError != cudaSuccess)
mexErrMsgTxt(“Out of Nvidia device memory D.”);
mexPrintf(“Success D”);

and I get “success”…

that’s why to me it looks like threads are not launched since I never get errors and data I copy back is not modified.

Are we sure that the call:

rotationMatrix<<<blocksPerGrid, threadsPerBlock>>>(d_angles, d_output_mat, order, N, d_C);

will wait for the threads to finish their job?


PS: In my Cleanup() function I am using the function cudaThreadExit(), which looks a bit messy. Could it be related?

That call is asynchronous and never waits for thread completion. However, the next cudaMemcpy() will block until the threads are finished before copying any data.

Hi all,

my problem is finally fixed!
I finally caught one error message after calling a thread synch, it was a cudaErrorLaunchOutOfResources.
So, after compiling my code with the --ptxas-options = “-v” I discovered that adding the product to my code was increasing the required number of registers needed by the kernel. When multiplying this slight increase by the number of threads… I was over the maximal number of registers for the multiprocessor.
So I just reduced the number of threads…

thank you all for your help.