My Hardware:
Mac OSX 10.6.6
2.66 Ghz Intel Core 7
NVIDIA GeForce GT 330M:
So I was asked to measure how long a certain calculation took in a thread on a GPU versus on a CPU. That is, setting aside memory transfer, kernel launching, and all other forms of overhead, how long does it take to execute a certain calculation in a single thread.
So I timed how long it took to do the following instructions on my CPU
for (unsigned int i =0; i < num_repeat; i++) {
random = random + i/100;
}
and then I timed the following kernel with num_repeat set to 0, 1000, and 100000 respectively
global void randtime(float *d_rand, unsigned int seed, unsigned int timestep, unsigned int n_particles, unsigned int num_repeat)
{
unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx >= n_particles)
return;
float random = 0;
for (unsigned int i =0; i < num_repeat; i++) {
random = random + i/100;
}
d_rand[idx]=random;
}
Then I subtracted off the time it took to launch the kernel with num_repeats = 0 to try to approximately extract just the additional time for calculating the loop num_repeat times. I am making some assumption that my compiler is not removing any instructions because my calculation is bogus (In fact, in each case, I am printing the value to the screen so that it is being used). My assumption is that as long as I not saturating the device, the timing of the GPU should be invariant to how many threads/blocks I launch. So I varied the threads between 10-100 and kept the threads per block to be 64, and basically observed this. I also repeat this calculation 10 times to try to remove variability in the launch.
I am consistently finding that the GPU takes 100 times longer to do the same calculation as the CPU!
The CPU takes roughly 0.000003 seconds per loop
The GPU takes roughly 0.000237 seconds per loop.
This seems quite surprising! I was expecting a single GPU thread to be slower than the CPU, but not two orders of magnitude. Am I missing something?
I will attach my code.
timing.cu (3.6 KB)