How to Implement Performance Metrics in CUDA C/C++

Originally published at:

In the first post of this series we looked at the basic elements of CUDA C/C++ by examining a CUDA C/C++ implementation of SAXPY. In this second post we discuss how to analyze the performance of this and other CUDA C/C++ codes. We will rely on these performance measurement techniques in future posts where performance optimization…

When the runtime is too high, a problem happens. The output is zero or very low!

::::MY CODE::::
cudaEvent_t start, stop;
float elapsedTime;


Mult(A, B, C, blockSize, gridSize);


cudaEventElapsedTime(&elapsedTime, start,stop);

printf("Elapsed time : %.2f ms\n" ,elapsedTime);
printf("Elapsed time : %.2f s\n" ,elapsedTime/1000);
printf("Elapsed time : %.2f min\n" ,(elapsedTime/1000)/60);
printf("Elapsed time : %.2f h\n" ,((elapsedTime/1000)/60)/60);

::::OUTPUT (A[16384x16384], B[16384x16384])::::

Elapsed time : 0.00 ms
Elapsed time : 0.00 s
Elapsed time : 0.00 min
Elapsed time : 0.00 h

I counted 8 seconds to run!!!!!!!

Would it make sense to use "N*sizeof(float)" rather than "N*4" for the bandwidth calculation?

If i want to calculate the bandwidth for double precison is enough to change N*4*3 with N*8*3 ? In my opinion bandwidth in double precison shoud be lower then float. I have try this test and :
bandwidth in single precison is :30GB
bandwidth in double precison is: 45 GB
I don't think this is correct. Has somebody any idea?

You would also need to change the code to use double instead of float. Memory bandwidth is not directly related to datatype. It may be that your array is not large enough to saturate memory bandwidth, so by changing from float to double you may be better utilizing memory bandwidth. But the computational throughput of double vs. single may also come into play. I would need to know what GPU you are running on and how big N is to reason about this better.


You are probably hitting the OS watchdog timer. If you are on a system where the GPU is attached to a display, it will not allow a kernel to run for more than a couple of seconds before killing it.

Hey, thanks for your answer.
I have a GeForce GT 750M. About the N is the same number like in your example (N = 20 * (1 20);). In the code I didn't make other changes, just I have replaced float with a double.
My board has a BUS with 128 bits I think (in the code show me 128 bits and online say 2x128 bits) so my theoretical bandwidth alter your formula will be 30GB (and that is the output of the program) but for double I don't see what is the problem.

Thanks for this great tutorial ..
I've a question, I'm preparing to give a training in an automotive company, can I use this tutorial examples and other tutorials here with mentioning the source? Can you also refer me to other tutorial by Nvidea that I can use ?

Why is such a big difference between THEORETICAL BANDWIDTH and EFFECTIVE BANDWIDTH.
This should be almost the same, but also the difference is about 30GB.

*my first question:
cudaEventSynchronize() blocks CPU
execution until the specified event is recorded. In the example shown,
the cudaEventRecord(end) comes before the cudaMemcpy() instruction.
Both instructions were issued to the same stream (stream 0), so
cudaMemcpy() will be exec after cudaEventRecord(end) from the device
perspective. In this case, i suppose we don't need to add the
cudaEventSynchronize() after the cudaMemcpy(), because cudaMemcpy() is
synchronous with the host therefor after the copy back from device we
are certin that the event was recorded.

*for my second question:
when measuring the Effective Bandwidth, i suppose we should place the
cudaEventRecord() around the cudaMemcpy() and not around the kernel
execution !

Hi Azri, neither of those are questions, but I'll try to answer anyway.
1. Correct.
2. That depends. If you want to measure GPU device memory bandwidth achieved by the kernel, then time the kernel. If you want to measure host-device interconnect bandwidth achieved, then time the memcpy.

thanks ^^

I suppose not all instructions within a period of time involve memory transfers?

Are badwidth and throughput increased when we increase input array size? Can someone please explain?

This is really helpful! Thank you about your tips!
By the way, I have a stupid question to you with my code.
void HelloGPU() {
printf("hello Worlds.\n");
int main() {
float ms;
cudaEvent_t start, end;

cudaEventRecord(start, 0);
HelloGPU << <10, 1 >> > (); //or HelloGPU << <1, 10 >> > ();
cudaEventRecord(end, 0);
cudaEventElapsedTime(&ms, start, end);
printf("GPU = %.20lf\n", ms);
This is code for confiming a speed between CPU and GPU process.
However, I just got a 0.153 sec in GPU rather than 0.003 sec in CPU which was made of 'for' loop and sentences.
I thought GPU process is much faster than CPU process but the result of mine is different.
Can I get a some hit or can you tell me what i miss?

Thank you!

Hi ! My answer maybe not definitely correct . For your reference, there are many cores in GPU.
On the other hand , there is just few cores in CPU, but they are much stronger than GPU cores.
So, GPU is suitable for intensively computing program.

I am sorry about my late reply. BUT I can understand why!
Thank you!^^

I wonder if there is a way to measure the runtime of a kernel inside the device code. To be a little more specific, I have a kernel that I run on N blocks each with M threads, like this:

My_Kernel<<<n, m="">>>();

I'm running N embarrassingly parallel simulations that are each assigned to one block of M threads. So, computations in each block are independent of other blocks. I need to measure the runtime of each block (simulation).
I would be so thankful if you can kindly suggest a way to do so.

Apart from GPU device memory bandwidth, are we not also counting the time taken for the multiply-add operation?