Timing Question timing of a function not clear


i have here some strange timings and now i not sure why this happens.

ok my programm is something like that

__global__ void doSomething(float * a, float *b, float *c){

      copy c to shared (c has some parameters in it)


      make some calculations for indexing

      read 4 times from b

      write the calcresult to a


void doSome(float *a, float*b){


       calc float *c


       copy a,b to gpu

       dim3 block(16, 16);

       dim3 grid(512, 256);

       doSomething<<grid, block>>(a,b,c);


      delete b;

      delete c;


so now. if i run this programm without deleting anything it takes me 0.4 ms but if i add one of the deletes it takes me 280ms. could someone tell me why? i thought the programm is waiting until the doSomething call has finished. is there anything else running/waiting for something?


Calling the doSomething kernel returns immediately, and the kernel runs on the GPU in the background. However, as soon as you call one of the CUDA memory functions (I’m assuming you delete b and c with cudaFree()), the system will wait for the kernel to finish.

You can explicitly wait for the kernel to finish (which is good for timing studies) with cudaThreadSynchronize();

ok that explains something :-).


and there is the next question. why takes a simple write option like

float value = 9;

object[index] = value; //object is a array of some size copied from the cpu



maybe it is a good thing to post the full code so we can see where you are beginning with the timing and ending it. So to make sure that you do a cudaSynchronizeThreads() before you stop the timer.

Pseudo code doesn’t help here :)

Well, accessing global memory is rather expensive operation (in compare with arithmetics ones). However it is doubtful whether it can take 190 ms.

It’s hard to say what takes performance (btw probably you measure time in incorrect way). If you post your real code, the problem will be more clear.

jordyvaneijk seems we were writing our post concurrently :D

Are you doing this inside the kernel?

If so make sure you don’t want to write to global memory on the GPU that take a lot of time try shared or registers or something like that.

i do nothing more then


gpu_doSomething<<grid, block>>(object1, object2, size_x, size_y, size_z);



and in the gpu function:

__global__ void gpu_doSomething(float * object1, float * object2 float * config)


  ... calc tid of thread...

   float  value = object2[tid];

     ...calc the indexes...    

    if (rx_floor > 0 && ry_floor > 0 && rx_floor <= size[0] && ry_floor <= size[1])


        ....make some linear interpolation with object2 values...

        value = value + (xin1 + (xin2-xin1) *yd);

    	object1[tid] = value;  // <--------- this line takes about 200ms or more



if i start the programm without the object1[tid] = value; the programm needs 13,4 ms to complet. if i enable the line the programm needs 280ms. (more or less)

object1 is that already copied to the device? because it looks like it is still on the host this way and you constantly doing a read/write action over the pci-e bus :)

float * object1_cpu;

float * object2_cpu;

float * object1;

float * object2;

//..some init done by cpu...

//..fill object1_cpu and object2_cpu with some data from files

//memsize_o is the correct size of object1, object2, object1_cpu and object2_cpu

CUDA_SAFE_CALL(cudaMalloc((void**)&object1, memsize_o));

CUDA_SAFE_CALL(cudaMalloc((void**)&object2, memsize_o));

CUDA_SAFE_CALL(cudaMemcpy(object1, object1_cpu, memsize_o, cudaMemcpyHostToDevice));

CUDA_SAFE_CALL(cudaMemcpy(object2, object2_cpu, memsize_o, cudaMemcpyHostToDevice));


gpu_doSomething<<grid, block>>(object1, object2, size_x, size_y, size_z);



You have to consider nvcc optimization. Without the line I guess the compiler realizes that the kernel is pointless or large parts of the kernel are pointless so it will just optimize the kernel. That means he deletes code that he considers useless because the results of the code is not used anywhere.
When you add the line you start making sense with your code from the compiler viewpoint: your calculations are written somewhere and therefore not useless. So now the compiler probably doesn’t delete them.
So by adding that line you’re not simply adding that line but probably a lot more.

I think seb is right.
If kernel writes nothing to device memory compiler will remove dead code completely. I’ve seen this many times in my kernels. And IMO this is good from performance point of view.

In CUDA 1.1, events can be used to time GPU operations accurately without having to wonder if you’ve done the proper host-based synchronization before and after the operations you are timing. Operations other than non-async host<->device memcpy are performed asynchronously, so CUDA functions often return before the operation has been completed. This is equally true of cudaEventRecord (or cuEventRecord); but timestamps are recorded at the same time as the events, so you can later use cu(da)EventElapsedTime to compute the difference between the timestamps of two events.

You do have to make sure both events have been recorded, this can be done with cu(da)EventQuery or cu(da)EventSynchronize.

So instead of

// begin host-based timing
// operations to be timed
// end host-based timing

You would do something like:
cudaEvent_t evStart, evStop;
cudaEventRecord( evStart, 0 );
// operations to be timed
cudaEventRecord( evStop, 0 );
// intervening operations, if desired
cudaEventSynchronize( evStop );
float timeInMs;
cudaEventElapsedTime( &timeInMs, evStart, evStop );

The code isn’t any simpler, but it is portable (no need to abstract host-based timing mechanisms, QueryPerformanceCounter vs. gettimeofday() etc.) and it is not subject to the vagaries of what operations preceded the first cudaEventRecord or came after the second cudaEventRecord.

Events are best to measure download/process/upload sequences, or to measure the amount of time a specific kernel took to execute. For total wallclock times (e.g. to compare CPU v. GPU performance), host-based timing mechanisms are preferable.

For more examples of how to use events for timing, check out asyncAPI and simpleStreams samples in the CUDA 1.1 SDK.


ok i 'll have a look but this does not explain why the code needs 13ms without this statment and 280ms with this statement. i have many other kernels where the write operation to memory goes faster than that. any other ideas why this could happen.

Without this statement nvcc removes the dead code (probably most of your code). This is why you’re getting as little as 13 ms.