A few questions on CUDA performance with pictures!

_Marcel · January 9, 2009, 10:50am

Hi,

I’ve been testing the performance of a small algorithm that runs well on the GPU. But to do a comparison between the performance of the CPU and the GPU I started measuring the time it takes for three (essential) operations namely: memory copying (host → device and device → host) and of course the total computation time.

(Hardware: 8800GTX, Intel Q6600, 4GB Ram)

After plotting the results for various input sizes (ranging from 2^1 up to 2^22) I found a few results that I can’t explain. Let me first show a plot of the first 15 tests. I’ve also included bars for the time it takes to cast the input from double to float and back to float after the computation, that’s just because I get the input from other software that only uses doubles, so I can’t get around that for my project.

External Media

Questions:

the most obvious result is that as soon as the input size is > 768 the time required for copying the input to the device increases heavily, I know that 768 is a returning number for some parts, but I expected the computation to suffer from that, not the memory operations, any explanation?
I was also wondering why the time required for the memory operation (device → host) scales linear, but from host → device seems constant for input sizes < 1024 and also for input sizes between 1024 and ~16000. This could be just due to timing differences, but if there’s any other explanation I’d really like to hear.

The code that gets executed on the GPU isn’t very big, namely:

o[index] = const_a * __expf( -(a_m*a_m / const_b) );

for each element. For most input sizes ( 2^10 … 2^22) I’ve calculated the percentages from the total time of all measured operations as in the above figure. In the pie chart below, it can be seen that the actual computation takes about 3% of the total time (6% when ignoring the casting operations). Is this a number that can be expected for such a small GPU function?

This percentage ranges from around 20% to just 1% (@ 4M elements) depending on input size.

External Media

In the below figure I’ve plotted the times it takes for the host <-> device memory operations and the actual computation for input sizes > 2^15. Is it correct that I’m seeing such a ‘big’ difference in time between host → device and device → host?

External Media

Lastly, the function I used to time the several operations (Linux OS):

#include<time.h>

#include<sys/time.h>

struct timeval tv;

  gettimeofday(&tv, NULL);

  tt_1 = tv.tv_sec + (tv.tv_usec/1e6);

  /* OPERATION */

  gettimeofday(&tv, NULL);

  tt_2 = tv.tv_sec + (tv.tv_usec/1e6);

  timer = tt_2 - tt_1;

I’ve also used cudaThreadSynchronize() after the kernel invocation.

If anyone reached this far, am I talking nonsense or are these things explainable?

Thanks in advance!

MisterAnderson42 · January 9, 2009, 12:40pm

It is usually easier to think about these things in terms of bandwidth, rather than absolute time in seconds. Over PCI-e, you can reasonably expect to get ~4GiB/s of copy bandwidth (using pinned memory). On an 8800 GTX, you can get a device memory bandwidth of ~70 GiB/s. In this light, you should be able to see why the GPU kernel time is such a low percentage in your pie chart. Avoiding host->device and device->host memcpys is essential to getting overall ~20-40x speedups. One way to do this by putting more steps of the algorithm on the GPU in order to avoid the back and forth copying.

I have no idea why your host->device memcpys are so much longer than device->host. Do you get this behavior in the SDK sample bandwidthTest, too?

Also, it shouldn’t be a matter of “just trying” to put cudaThreadSynchronize() after the kernel launch. It is an absolute requirement if you want to measure its running time. In fact, you should have a cudaThreadSynchronize() before every single wall clock measurement you make to be on the safe side.

_Marcel · January 9, 2009, 1:37pm

Thanks for your reply.

I’ve made the mistake not including cudaThreadSynchronize() after the kernel call, but quickly found out since the reported times didn’t change depending on the input size.

The difference between host → device and device → host is also visible when testing with bandwidthTest() as shown below:

Running on......

	  device 0:GeForce 8800 GTX

Quick Mode

Host to Device Bandwidth for Pageable memory

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432			   <b>1670.9</b>

Quick Mode

Device to Host Bandwidth for Pageable memory

Transfer Size (Bytes)   Bandwidth(MB/s)

 33554432			   <b>1557.5</b>

showing a difference of 6.8%. The difference with pinned memory is 6.0% in the advantage of host → device. 6% is a lot less of a difference that what I’m seeing…

LinkStrife · January 9, 2009, 2:34pm

Hi,

I’ve been testing the performance of a small algorithm that runs well on the GPU. But to do a comparison between the performance of the CPU and the GPU I started measuring the time it takes for three (essential) operations namely: memory copying (host → device and device → host) and of course the total computation time.

(Hardware: 8800GTX, Intel Q6600, 4GB Ram)

After plotting the results for various input sizes (ranging from 2^1 up to 2^22) I found a few results that I can’t explain. Let me first show a plot of the first 15 tests. I’ve also included bars for the time it takes to cast the input from double to float and back to float after the computation, that’s just because I get the input from other software that only uses doubles, so I can’t get around that for my project.

External Media

Questions:

the most obvious result is that as soon as the input size is > 768 the time required for copying the input to the device increases heavily, I know that 768 is a returning number for some parts, but I expected the computation to suffer from that, not the memory operations, any explanation?

I was also wondering why the time required for the memory operation (device → host) scales linear, but from host → device seems constant for input sizes < 1024 and also for input sizes between 1024 and ~16000. This could be just due to timing differences, but if there’s any other explanation I’d really like to hear.

The code that gets executed on the GPU isn’t very big, namely:
o[index] = const_a * __expf( -(a_m*a_m / const_b) );
for each element. For most input sizes ( 2^10 … 2^22) I’ve calculated the percentages from the total time of all measured operations as in the above figure. In the pie chart below, it can be seen that the actual computation takes about 3% of the total time (6% when ignoring the casting operations). Is this a number that can be expected for such a small GPU function?

This percentage ranges from around 20% to just 1% (@ 4M elements) depending on input size.

External Media

In the below figure I’ve plotted the times it takes for the host ↔ device memory operations and the actual computation for input sizes > 2^15. Is it correct that I’m seeing such a ‘big’ difference in time between host → device and device → host?

External Media

Lastly, the function I used to time the several operations (Linux OS):
#include<time.h>

#include<sys/time.h>

struct timeval tv;

  gettimeofday(&tv, NULL);

  tt_1 = tv.tv_sec + (tv.tv_usec/1e6);

  /* OPERATION */

  gettimeofday(&tv, NULL);

  tt_2 = tv.tv_sec + (tv.tv_usec/1e6);

  timer = tt_2 - tt_1;
I’ve also used cudaThreadSynchronize() after the kernel invocation.

If anyone reached this far, am I talking nonsense or are these things explainable?

Thanks in advance!

I have te same question…including all the MemCpys the process time in the GPU itÂ´s higher than the CPU time.

It’s that for short algorithms this type of parallel approach doesn’t seem to work??

Cya

Fugl · January 9, 2009, 2:39pm

I’ve done some benchmarks also on a GF8800 GTX - You can see it in this post.

To summarize, I didn’t see any hiccups like you do on device-host or host-device transfers, but I do measured saw that device-device copy performance is relatively low for transfer sizes < 1024 bytes.

_Marcel · January 9, 2009, 10:26pm

That is indeed true and can also be seen when using the --mode=shmoo with the bandwidthTest() CUDA application from the SDK.

One of the other questions I’m still searching for is why the ‘hickup’ is visible for input size > 768 in the first plot?

Thanks for all your replies!

have a nice weekend External Media

Homebody · January 10, 2009, 2:21am

Hi,

I’m using GPU to solve a large set of differential-algebraic equations in time-domain, which is basically working with large matrices. At beginning of my project, I used cpu (C++) to compute the required matrices and then I transferred them to GPU for doing matrix operations. I used CUBLAS library. I got very good results and speedup in compare with doing everything in CPU. My next step was to do all computations in GPU, and at the end of simulation time copy the results from device to host. As you mentioned here, I would expect more speedup in compare with the first program, because I deleted all the copy commands, and I did all the computations in GPU. However, the first program is two times faster than the second one!! :(((

Do you have any explanation or suggestion for this case?

Thank you.

It is usually easier to think about these things in terms of bandwidth, rather than absolute time in seconds. Over PCI-e, you can reasonably expect to get ~4GiB/s of copy bandwidth (using pinned memory). On an 8800 GTX, you can get a device memory bandwidth of ~70 GiB/s. In this light, you should be able to see why the GPU kernel time is such a low percentage in your pie chart. Avoiding host->device and device->host memcpys is essential to getting overall ~20-40x speedups. One way to do this by putting more steps of the algorithm on the GPU in order to avoid the back and forth copying.

I have no idea why your host->device memcpys are so much longer than device->host. Do you get this behavior in the SDK sample bandwidthTest, too?

Also, it shouldn’t be a matter of “just trying” to put cudaThreadSynchronize() after the kernel launch. It is an absolute requirement if you want to measure its running time. In fact, you should have a cudaThreadSynchronize() before every single wall clock measurement you make to be on the safe side.

Topic		Replies	Views
How to Implement Performance Metrics in CUDA C/C++ Technical Blog	20	867	March 11, 2020
How to get peak rate with simple opeartion Question about performance optimization CUDA Programming and Performance	17	13629	June 2, 2008
Can you GUESS this without experimenting? Latencies CUDA Programming and Performance	13	9347	January 7, 2008
Inconsistent concurrent transfer speed CUDA Programming and Performance	21	1188	April 17, 2023
Maximum number of queued kernels CUDA Programming and Performance	21	7760	September 3, 2008
device speed vs. host speed Why is my device program so slow? CUDA Programming and Performance	8	7892	August 16, 2007
Why 8800 is faster? CUDA Programming and Performance	15	10270	May 13, 2009
well how do I know if cuda runs on the gpu CUDA Programming and Performance	20	13380	July 9, 2008
Questions regarding allocation of buffers/memory CUDA Programming and Performance	11	915	April 20, 2017
Speed improvement CUDA Programming and Performance	18	8268	December 5, 2008

A few questions on CUDA performance with pictures!

Related topics