Hi,
I’ve been testing the performance of a small algorithm that runs well on the GPU. But to do a comparison between the performance of the CPU and the GPU I started measuring the time it takes for three (essential) operations namely: memory copying (host → device and device → host) and of course the total computation time.
(Hardware: 8800GTX, Intel Q6600, 4GB Ram)
After plotting the results for various input sizes (ranging from 2^1 up to 2^22) I found a few results that I can’t explain. Let me first show a plot of the first 15 tests. I’ve also included bars for the time it takes to cast the input from double to float and back to float after the computation, that’s just because I get the input from other software that only uses doubles, so I can’t get around that for my project.
Questions:
-
the most obvious result is that as soon as the input size is > 768 the time required for copying the input to the device increases heavily, I know that 768 is a returning number for some parts, but I expected the computation to suffer from that, not the memory operations, any explanation?
-
I was also wondering why the time required for the memory operation (device → host) scales linear, but from host → device seems constant for input sizes < 1024 and also for input sizes between 1024 and ~16000. This could be just due to timing differences, but if there’s any other explanation I’d really like to hear.
The code that gets executed on the GPU isn’t very big, namely:
o[index] = const_a * __expf( -(a_m*a_m / const_b) );
for each element. For most input sizes ( 2^10 … 2^22) I’ve calculated the percentages from the total time of all measured operations as in the above figure. In the pie chart below, it can be seen that the actual computation takes about 3% of the total time (6% when ignoring the casting operations). Is this a number that can be expected for such a small GPU function?
This percentage ranges from around 20% to just 1% (@ 4M elements) depending on input size.
In the below figure I’ve plotted the times it takes for the host <-> device memory operations and the actual computation for input sizes > 2^15. Is it correct that I’m seeing such a ‘big’ difference in time between host → device and device → host?
Lastly, the function I used to time the several operations (Linux OS):
#include<time.h>
#include<sys/time.h>
struct timeval tv;
gettimeofday(&tv, NULL);
tt_1 = tv.tv_sec + (tv.tv_usec/1e6);
/* OPERATION */
gettimeofday(&tv, NULL);
tt_2 = tv.tv_sec + (tv.tv_usec/1e6);
timer = tt_2 - tt_1;
I’ve also used cudaThreadSynchronize() after the kernel invocation.
If anyone reached this far, am I talking nonsense or are these things explainable?
Thanks in advance!