How to Implement Performance Metrics in CUDA C/C++

jwitsoe · November 11, 2013, 11:24pm

Originally published at: https://developer.nvidia.com/blog/how-implement-performance-metrics-cuda-cc/

In the first post of this series we looked at the basic elements of CUDA C/C++ by examining a CUDA C/C++ implementation of SAXPY. In this second post we discuss how to analyze the performance of this and other CUDA C/C++ codes. We will rely on these performance measurement techniques in future posts where performance optimization…

anon54350727 · May 1, 2014, 7:38pm

When the runtime is too high, a problem happens. The output is zero or very low!

::::MY CODE::::
cudaEvent_t start, stop;
float elapsedTime;

cudaEventCreate(&start);
cudaEventRecord(start,0);

Mult(A, B, C, blockSize, gridSize);

cudaEventCreate(&stop);
cudaEventRecord(stop,0);
cudaEventSynchronize(stop);

cudaEventElapsedTime(&elapsedTime, start,stop);

printf("Elapsed time : %.2f ms\n" ,elapsedTime);
printf("Elapsed time : %.2f s\n" ,elapsedTime/1000);
printf("Elapsed time : %.2f min\n" ,(elapsedTime/1000)/60);
printf("Elapsed time : %.2f h\n" ,((elapsedTime/1000)/60)/60);

::::OUTPUT (A[16384x16384], B[16384x16384])::::

Elapsed time : 0.00 ms
Elapsed time : 0.00 s
Elapsed time : 0.00 min
Elapsed time : 0.00 h

I counted 8 seconds to run!!!!!!!

anon16811445 · June 2, 2016, 7:28pm

Would it make sense to use "N*sizeof(float)" rather than "N*4" for the bandwidth calculation?

anon56518932 · November 10, 2016, 5:14pm

Hay,
If i want to calculate the bandwidth for double precison is enough to change N*4*3 with N*8*3 ? In my opinion bandwidth in double precison shoud be lower then float. I have try this test and :
bandwidth in single precison is :30GB
bandwidth in double precison is: 45 GB
I don't think this is correct. Has somebody any idea?
Thanks:)

anon95180265 · November 10, 2016, 5:57pm

You would also need to change the code to use double instead of float. Memory bandwidth is not directly related to datatype. It may be that your array is not large enough to saturate memory bandwidth, so by changing from float to double you may be better utilizing memory bandwidth. But the computational throughput of double vs. single may also come into play. I would need to know what GPU you are running on and how big N is to reason about this better.

anon95180265 · November 10, 2016, 5:57pm

Yes.

anon95180265 · November 10, 2016, 5:58pm

You are probably hitting the OS watchdog timer. If you are on a system where the GPU is attached to a display, it will not allow a kernel to run for more than a couple of seconds before killing it.

anon56518932 · November 11, 2016, 2:19pm

Hey, thanks for your answer.
I have a GeForce GT 750M. About the N is the same number like in your example (N = 20 * (1 20);). In the code I didn't make other changes, just I have replaced float with a double.
My board has a BUS with 128 bits I think (in the code show me 128 bits and online say 2x128 bits) so my theoretical bandwidth alter your formula will be 30GB (and that is the output of the program) but for double I don't see what is the problem.

anon49064869 · February 2, 2017, 2:15pm

Thanks for this great tutorial ..
I've a question, I'm preparing to give a training in an automotive company, can I use this tutorial examples and other tutorials here with mentioning the source? Can you also refer me to other tutorial by Nvidea that I can use ?

anon4365765 · February 21, 2017, 1:10pm

Hei,
Why is such a big difference between THEORETICAL BANDWIDTH and EFFECTIVE BANDWIDTH.
This should be almost the same, but also the difference is about 30GB.

anon85461502 · November 30, 2017, 8:46am

hi,
*my first question:
cudaEventSynchronize() blocks CPU
execution until the specified event is recorded. In the example shown,
the cudaEventRecord(end) comes before the cudaMemcpy() instruction.
Both instructions were issued to the same stream (stream 0), so
cudaMemcpy() will be exec after cudaEventRecord(end) from the device
perspective. In this case, i suppose we don't need to add the
cudaEventSynchronize() after the cudaMemcpy(), because cudaMemcpy() is
synchronous with the host therefor after the copy back from device we
are certin that the event was recorded.

*for my second question:
when measuring the Effective Bandwidth, i suppose we should place the
cudaEventRecord() around the cudaMemcpy() and not around the kernel
execution !

anon95180265 · November 30, 2017, 10:32am

Hi Azri, neither of those are questions, but I'll try to answer anyway.
1. Correct.
2. That depends. If you want to measure GPU device memory bandwidth achieved by the kernel, then time the kernel. If you want to measure host-device interconnect bandwidth achieved, then time the memcpy.

anon85461502 · November 30, 2017, 3:49pm

thanks ^^

anon11517108 · April 6, 2018, 11:02am

I suppose not all instructions within a period of time involve memory transfers?

anon67955924 · November 13, 2018, 1:41pm

Are badwidth and throughput increased when we increase input array size? Can someone please explain?

anon94995092 · November 16, 2018, 8:01am

This is really helpful! Thank you about your tips!
By the way, I have a stupid question to you with my code.
=================================
__global__
void HelloGPU() {
printf("hello Worlds.\n");
}
int main() {
float ms;
cudaEvent_t start, end;
cudaEventCreate(&start);
cudaEventCreate(&end);

cudaEventRecord(start, 0);
HelloGPU << <10, 1 >> > (); //or HelloGPU << <1, 10 >> > ();
cudaEventRecord(end, 0);
cudaEventSynchronize(end);
cudaEventElapsedTime(&ms, start, end);
printf("GPU = %.20lf\n", ms);
=================================
This is code for confiming a speed between CPU and GPU process.
However, I just got a 0.153 sec in GPU rather than 0.003 sec in CPU which was made of 'for' loop and sentences.
I thought GPU process is much faster than CPU process but the result of mine is different.
Can I get a some hit or can you tell me what i miss?

Thank you!

anon67851649 · December 3, 2018, 12:53pm

Hi ! My answer maybe not definitely correct . For your reference, there are many cores in GPU.
On the other hand , there is just few cores in CPU, but they are much stronger than GPU cores.
So, GPU is suitable for intensively computing program.

anon94995092 · December 12, 2018, 8:55am

I am sorry about my late reply. BUT I can understand why!
Thank you!^^

anon76646743 · January 7, 2020, 10:54pm

I wonder if there is a way to measure the runtime of a kernel inside the device code. To be a little more specific, I have a kernel that I run on N blocks each with M threads, like this:

My_Kernel<<<n, m="">>>();

I'm running N embarrassingly parallel simulations that are each assigned to one block of M threads. So, computations in each block are independent of other blocks. I need to measure the runtime of each block (simulation).
I would be so thankful if you can kindly suggest a way to do so.

anon16465126 · March 11, 2020, 2:36pm

Apart from GPU device memory bandwidth, are we not also counting the time taken for the multiply-add operation?

Topic		Replies	Views
Inconsistent concurrent transfer speed CUDA Programming and Performance	21	1107	April 17, 2023
Performance test sharedmemory <-> globalmemory CUDA Programming and Performance	2	3931	May 30, 2008
An Easy Introduction to CUDA C and C++ Technical Blog	48	1102	July 19, 2018
How to Optimize Data Transfers in CUDA C/C++ Technical Blog	12	1143	January 22, 2022
A few questions on CUDA performance with pictures! CUDA Programming and Performance	6	3349	January 10, 2009
GPU/CPU precision comparison and Kernel instructions question CUDA Programming and Performance	5	669	April 4, 2017
Very newbie questions on synchronisation between GPU & CPU, and time measurement CUDA Programming and Performance	4	485	December 17, 2017
CUDA Newbie bandwidth question CUDA Programming and Performance	0	7890	January 25, 2008
Using bandwidthTest tool, D2D performance More than the official given bandwidth CUDA Programming and Performance cuda	6	806	October 28, 2022
How to improve the performance of using CUDA IPC shared memory? CUDA Programming and Performance cuda	5	85	October 23, 2024

How to Implement Performance Metrics in CUDA C/C++

Related topics