Timing Error

So when I time my kernel execution from my main .cu file, about 30% of the time I get a result of around 4000ms compared to normal 8-12ms. I don’t know what it could be, as it is being run with the same input. I’m using CUT_SAFE_CALL(cutCreateTimer(timer)) where timer is an unsigned int. If you need more information, just post and I shall reply shortly.

Edit: Also, when I get the results around 4000ms the executable that is running the program freezes, as well as the mouse freezes.

Edit2: Could it be the way I have set up my loop to call a device function from my global function? I am new to CUDA and feel like I am not using the threads correctly.

for(i = threadIdx.x; i < nColumns; i++){
for(j = threadIdx.y; j < nColumns; j++){
x = j+i*nColumns;
Covariance(cpuA, x, nColumns, result, vector, vector2);
}
}

Ok, extract minimal code which reproduces the problem and post it here.

Usually you do that by

dim3 threads(64,64);

dim3 grid(64,64);

device_kernel<<<grid, threads>>>();

which will do these cycles. In this case it will do 4096x4096 runs.

Ofcourse you may put your numbers instead of 64

Inside Covariance coordinates are:

const int ix = blockDim.x * blockIdx.x + threadIdx.x;

const int iy = blockDim. * blockIdx.y + threadIdx.y;

You have threads & blocks, not just threads.

Ok, thanks. I seem to have forgotten that as I just started learning CUDA a week or so ago. with the device_kernel<<<grid,threads>>>(); you say it will do 4096 runs. How does that work if I want to call an specific element in an array based on what run it is? I think this is the part that confuses me.

Edit: I think I understand the <<<grid,threads>>> after looking at it some more. I’ll post again if I still have problems with this.

Also, I’ve noticed that if I run the program in short succession (run, then run again right away) the first run is around 4000, then the next couple are in the 10’s. I’m wondering if this is a memory problem. Also, I’ve frozen up the computer a couple times, which tells me that there is a mem. problem somewhere, but might not necessarily have to do with the timing problem.

In your post you say to call device_kernel<<<…>>>(…);
However, do I still use this format if I am calling the device function from inside my global in the same kernel?

No you dont, to call a device function from an other device function is just like calling a function in C as far as syntax goes.

As far as your first post goes, it seems you have not grasped the parrallel architecture provided with cuda. No singe post is going to help you understand it and you have to take the time to read the cuda programming guide and read/understand the basic SDK exemples.

I suggest you read through the “transpose” exemple in the SDK and concentrate on understanding how the “naive” kernel works, this should help you understand how to parallelize your problems, even if this exemple doesnt give great results, youll have plenty of time to optimise your applications after you understand the basics of how to write them.

I understand that conceptual part of the parallel architecture, its more of a question of syntax though. I wasn’t sure how to utilize the numerous threads as opposed to just running something on the device. BarsMonster’s post clarified quite a bit for me. For me, staring at an example does little for me(not that I haven’t looked at numerous of them). At some point, the only way to learn is to try things out and ask for help when confused. I am at this point.

The suggestion from Bars seemed to have fixed my problem. I still think I’m not cleaning up all of my memory quite correctly, but that can be found out later (learning C/CUDA in 1.5 weeks after programming in java for years is interesting). Thank you very much :)