Timing Error

senorbum · June 16, 2008, 4:35pm

So when I time my kernel execution from my main .cu file, about 30% of the time I get a result of around 4000ms compared to normal 8-12ms. I don’t know what it could be, as it is being run with the same input. I’m using CUT_SAFE_CALL(cutCreateTimer(timer)) where timer is an unsigned int. If you need more information, just post and I shall reply shortly.

Edit: Also, when I get the results around 4000ms the executable that is running the program freezes, as well as the mouse freezes.

Edit2: Could it be the way I have set up my loop to call a device function from my global function? I am new to CUDA and feel like I am not using the threads correctly.

for(i = threadIdx.x; i < nColumns; i++){
for(j = threadIdx.y; j < nColumns; j++){
x = j+i*nColumns;
Covariance(cpuA, x, nColumns, result, vector, vector2);
}
}

BarsMonster · June 16, 2008, 4:54pm

Ok, extract minimal code which reproduces the problem and post it here.

BarsMonster · June 16, 2008, 4:59pm

Edit2:Â Could it be the way I have set up my loop to call a device function from my global function?Â I am new to CUDA and feel like I am not using the threads correctly.

for(i = threadIdx.x; iÂ < nColumns; i++){
for(j = threadIdx.y; j < nColumns; j++){
Â x = j+i*nColumns;

Â Covariance(cpuA, x, nColumns, result, vector, vector2);

Â }
}
[snapback]394487[/snapback]

Usually you do that by

dim3 threads(64,64);

dim3 grid(64,64);

device_kernel<<<grid, threads>>>();

which will do these cycles. In this case it will do 4096x4096 runs.

Ofcourse you may put your numbers instead of 64

Inside Covariance coordinates are:

const int ix = blockDim.x * blockIdx.x + threadIdx.x;

const int iy = blockDim. * blockIdx.y + threadIdx.y;

You have threads & blocks, not just threads.

senorbum · June 16, 2008, 6:12pm

Usually you do that by

dim3 threads(64,64);

dim3 grid(64,64);
device_kernel<<<grid, threads>>>();
which will do these cycles. In this case it will do 4096x4096 runs.

Ofcourse you may put your numbers instead of 64

Inside Covariance coordinates are:

const int ix = blockDim.x * blockIdx.x + threadIdx.x;

const int iy = blockDim. * blockIdx.y + threadIdx.y;

You have threads & blocks, not just threads.

[snapback]394506[/snapback]

Ok, thanks. I seem to have forgotten that as I just started learning CUDA a week or so ago. with the device_kernel<<<grid,threads>>>(); you say it will do 4096 runs. How does that work if I want to call an specific element in an array based on what run it is? I think this is the part that confuses me.

Edit: I think I understand the <<<grid,threads>>> after looking at it some more. I’ll post again if I still have problems with this.

Also, I’ve noticed that if I run the program in short succession (run, then run again right away) the first run is around 4000, then the next couple are in the 10’s. I’m wondering if this is a memory problem. Also, I’ve frozen up the computer a couple times, which tells me that there is a mem. problem somewhere, but might not necessarily have to do with the timing problem.

senorbum · June 16, 2008, 6:16pm

In your post you say to call device_kernel<<<…>>>(…);
However, do I still use this format if I am calling the device function from inside my global in the same kernel?

Ailleur · June 16, 2008, 6:19pm

No you dont, to call a device function from an other device function is just like calling a function in C as far as syntax goes.

As far as your first post goes, it seems you have not grasped the parrallel architecture provided with cuda. No singe post is going to help you understand it and you have to take the time to read the cuda programming guide and read/understand the basic SDK exemples.

I suggest you read through the “transpose” exemple in the SDK and concentrate on understanding how the “naive” kernel works, this should help you understand how to parallelize your problems, even if this exemple doesnt give great results, youll have plenty of time to optimise your applications after you understand the basics of how to write them.

senorbum · June 16, 2008, 6:46pm

I understand that conceptual part of the parallel architecture, its more of a question of syntax though. I wasn’t sure how to utilize the numerous threads as opposed to just running something on the device. BarsMonster’s post clarified quite a bit for me. For me, staring at an example does little for me(not that I haven’t looked at numerous of them). At some point, the only way to learn is to try things out and ask for help when confused. I am at this point.

senorbum · June 16, 2008, 6:59pm

The suggestion from Bars seemed to have fixed my problem. I still think I’m not cleaning up all of my memory quite correctly, but that can be found out later (learning C/CUDA in 1.5 weeks after programming in java for years is interesting). Thank you very much :)

Topic		Replies	Views
Timing CUDA Code To find the best way to time CUDA code CUDA Programming and Performance	5	2088	January 6, 2009
Error running code that works in emulation mode CUDA Programming and Performance	5	4023	July 19, 2008
How properly counting a performance/program time ? CUDA Programming and Performance	4	2653	August 28, 2007
kernel in loop (time explodes) CUDA Programming and Performance	4	3559	June 29, 2009
Inconsistent kernel run times CUDA Programming and Performance	12	5939	August 5, 2009
how to compute time in cuda? CUDA Programming and Performance	3	3817	October 13, 2007
Timers not timing... CUDA Programming and Performance	7	4671	November 8, 2008
Timing the code CUDA Programming and Performance	5	4808	July 28, 2011
Complete freeze using CUDA CUDA Programming and Performance	5	1178	January 26, 2012
CUDA very slow performance CUDA Programming and Performance	21	17105	March 6, 2020

Timing Error

Related topics