# A few questions on CUDA performance with pictures!

Hi,

I’ve been testing the performance of a small algorithm that runs well on the GPU. But to do a comparison between the performance of the CPU and the GPU I started measuring the time it takes for three (essential) operations namely: memory copying (host -> device and device -> host) and of course the total computation time.

(Hardware: 8800GTX, Intel Q6600, 4GB Ram)

After plotting the results for various input sizes (ranging from 2^1 up to 2^22) I found a few results that I can’t explain. Let me first show a plot of the first 15 tests. I’ve also included bars for the time it takes to cast the input from double to float and back to float after the computation, that’s just because I get the input from other software that only uses doubles, so I can’t get around that for my project.

Questions:

• the most obvious result is that as soon as the input size is > 768 the time required for copying the input to the device increases heavily, I know that 768 is a returning number for some parts, but I expected the computation to suffer from that, not the memory operations, any explanation?

• I was also wondering why the time required for the memory operation (device -> host) scales linear, but from host -> device seems constant for input sizes < 1024 and also for input sizes between 1024 and ~16000. This could be just due to timing differences, but if there’s any other explanation I’d really like to hear.

The code that gets executed on the GPU isn’t very big, namely:

``````o[index] = const_a * __expf( -(a_m*a_m / const_b) );
``````

for each element. For most input sizes ( 2^10 … 2^22) I’ve calculated the percentages from the total time of all measured operations as in the above figure. In the pie chart below, it can be seen that the actual computation takes about 3% of the total time (6% when ignoring the casting operations). Is this a number that can be expected for such a small GPU function?

This percentage ranges from around 20% to just 1% (@ 4M elements) depending on input size.

In the below figure I’ve plotted the times it takes for the host <-> device memory operations and the actual computation for input sizes > 2^15. Is it correct that I’m seeing such a ‘big’ difference in time between host -> device and device -> host?

Lastly, the function I used to time the several operations (Linux OS):

``````#include<time.h>

#include<sys/time.h>

struct timeval tv;

gettimeofday(&tv, NULL);

tt_1 = tv.tv_sec + (tv.tv_usec/1e6);

/* OPERATION */

gettimeofday(&tv, NULL);

tt_2 = tv.tv_sec + (tv.tv_usec/1e6);

timer = tt_2 - tt_1;
``````

I’ve also used cudaThreadSynchronize() after the kernel invocation.

If anyone reached this far, am I talking nonsense or are these things explainable?

It is usually easier to think about these things in terms of bandwidth, rather than absolute time in seconds. Over PCI-e, you can reasonably expect to get ~4GiB/s of copy bandwidth (using pinned memory). On an 8800 GTX, you can get a device memory bandwidth of ~70 GiB/s. In this light, you should be able to see why the GPU kernel time is such a low percentage in your pie chart. Avoiding host->device and device->host memcpys is essential to getting overall ~20-40x speedups. One way to do this by putting more steps of the algorithm on the GPU in order to avoid the back and forth copying.

I have no idea why your host->device memcpys are so much longer than device->host. Do you get this behavior in the SDK sample bandwidthTest, too?

Also, it shouldn’t be a matter of “just trying” to put cudaThreadSynchronize() after the kernel launch. It is an absolute requirement if you want to measure its running time. In fact, you should have a cudaThreadSynchronize() before every single wall clock measurement you make to be on the safe side.

I’ve made the mistake not including cudaThreadSynchronize() after the kernel call, but quickly found out since the reported times didn’t change depending on the input size.

The difference between host -> device and device -> host is also visible when testing with bandwidthTest() as shown below:

``````Running on......

device 0:GeForce 8800 GTX

Quick Mode

Host to Device Bandwidth for Pageable memory

Transfer Size (Bytes)   Bandwidth(MB/s)

33554432			   <b>1670.9</b>

Quick Mode

Device to Host Bandwidth for Pageable memory

Transfer Size (Bytes)   Bandwidth(MB/s)

33554432			   <b>1557.5</b>
``````

showing a difference of 6.8%. The difference with pinned memory is 6.0% in the advantage of host -> device. 6% is a lot less of a difference that what I’m seeing…

I have te same question…including all the MemCpys the process time in the GPU itÂ´s higher than the CPU time.

It’s that for short algorithms this type of parallel approach doesn’t seem to work??

Cya

I’ve done some benchmarks also on a GF8800 GTX - You can see it in this post.

To summarize, I didn’t see any hiccups like you do on device-host or host-device transfers, but I do measured saw that device-device copy performance is relatively low for transfer sizes < 1024 bytes.

That is indeed true and can also be seen when using the --mode=shmoo with the bandwidthTest() CUDA application from the SDK.

One of the other questions I’m still searching for is why the ‘hickup’ is visible for input size > 768 in the first plot?