Measuring speed of a calculation in a single thread

My Hardware:
Mac OSX 10.6.6
2.66 Ghz Intel Core 7
NVIDIA GeForce GT 330M:

So I was asked to measure how long a certain calculation took in a thread on a GPU versus on a CPU. That is, setting aside memory transfer, kernel launching, and all other forms of overhead, how long does it take to execute a certain calculation in a single thread.

So I timed how long it took to do the following instructions on my CPU

for (unsigned int i =0; i < num_repeat; i++)   { 
    random = random + i/100;
    }

and then I timed the following kernel with num_repeat set to 0, 1000, and 100000 respectively

global void randtime(float *d_rand, unsigned int seed, unsigned int timestep, unsigned int n_particles, unsigned int num_repeat)
{
unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;

if (idx >= n_particles)
	return;
 
float random = 0;

for (unsigned int i =0; i < num_repeat; i++)   { 
    random = random + i/100;
    }
    
d_rand[idx]=random;

}

Then I subtracted off the time it took to launch the kernel with num_repeats = 0 to try to approximately extract just the additional time for calculating the loop num_repeat times. I am making some assumption that my compiler is not removing any instructions because my calculation is bogus (In fact, in each case, I am printing the value to the screen so that it is being used). My assumption is that as long as I not saturating the device, the timing of the GPU should be invariant to how many threads/blocks I launch. So I varied the threads between 10-100 and kept the threads per block to be 64, and basically observed this. I also repeat this calculation 10 times to try to remove variability in the launch.

I am consistently finding that the GPU takes 100 times longer to do the same calculation as the CPU!
The CPU takes roughly 0.000003 seconds per loop
The GPU takes roughly 0.000237 seconds per loop.

This seems quite surprising! I was expecting a single GPU thread to be slower than the CPU, but not two orders of magnitude. Am I missing something?
I will attach my code.
timing.cu (3.6 KB)

You are testing about the worst case for the GPU as there is no instruction for integer division, it has to be done in software instead.

If I change the instruction to
random = random + i*100;

(and instruction the GPU does well) then the time becomes
0.000080

Much lower indeed but still a factor of ~25

Try using __umul24() as there is no instruction for 32-bit integer multiplication on compute capability 1.x devices either…

Also keep in mind that, as the GPU runs at about half the frequency and has about six times the latency of the CPU for most instructions, a factor of about 12 is expected if the work is not parallelized.

I would say that you are nowhere close to timings where the actual loop is using up most of your time. So you are not measuring the time for the instruction at all.

Take a look at these numbers (num_repeats, nBlocks,nThreads,CPUtime,GPUtime)

1 1 1 0.000001 0.194528

10 1 1 0.000002 0.198620

100 1 1 0.000002 0.198025

1000 1 1 0.000007 0.210949

10000 1 1 0.000050 0.193124

100000 1 1 0.000489 0.201100

1000000 1 1 0.005397 0.260487

10000000 1 1 0.050916 0.861125

100000000 1 1 0.505497 6.837384

1000000000 1 1 5.056377 66.659534

So as you can see for the CPU you are starting to measure actual instruction time from num_repeats=10000. But for the GPU you need at least 100,000,000 repeats to measure actual instruction times and not the rest of the overhead.

and as you can see at that point you get a factor of ~13 as expected. These numbers stem from a i7920 2.66GHz and GTX295.

Just as an addition:

100000000 1 1 0.506001 6.967646

100000000 1 32 0.506001 7.133049

100000000 1 64 0.506893 7.138827

100000000 1 96 0.508717 7.480752

100000000 1 128 0.506646 8.014021

100000000 1 256 0.505241 11.119867

100000000 1 512 0.506002 22.080905

This shows you that you are actually measuring a lot of latency. After writing to a register you have to wait for a moment to write to it again. In that time another warp could be executed. Thus adding more warps does not increase the execution time much (even though all are executed on the same MP) until that latency is hidden at roughly 96 threads.

100000000 1 256 0.505241 11.119867

100000000 2 256 0.509685 11.119019

100000000 4 256 0.510011 11.151734

100000000 8 256 0.508728 11.120358

100000000 16 256 0.508936 11.128884

100000000 30 256 0.512102 11.122294

100000000 31 256 0.508599 22.083798

100000000 60 256 0.508817 23.074155

100000000 61 256 0.509118 32.801482

The GTX295 has 30 MPs so upt to 30 blocks each block is launched simultaniously on another MP. If you launch 31 blocks one MP has to go through two blocks, hence doubling the total time (not fully doubled due to launch overhead of about 0.2s). Again if you increase to 60 blocks still every MP has only to work through 2 blocks. At 61 you again get an increase in runtime.

Heres the code:

#include <cstdio>

#include <ctime>

__global__ void randtime(float *d_rand, unsigned int num_repeat)

{

unsigned int idx = blockIdx.x * blockDim.x + threadIdx.x;

float random = 0;

for (unsigned int i =0; i < num_repeat; i++) {

random = random + i*100;

}

d_rand[idx]=random;

}

int main(int argc, char *argv[]) {

  timespec starttime;

  timespec endtime;

    int n = 1;

    int t = 1;

    int b = 1;

    float random;

if (argc > 1) n = atoi(argv[1]);

    if (argc > 2) b = atoi(argv[2]);

    if (argc > 3) t = atoi(argv[3]);

clock_gettime(CLOCK_REALTIME,&starttime);

    for (unsigned int i =0; i < n; i++) {

      random = random + i*100;

    }

    clock_gettime(CLOCK_REALTIME,&endtime);

    double cputime=

      endtime.tv_sec-starttime.tv_sec+1.0*(endtime.tv_nsec-starttime.tv_nsec)/1000000000;

float* rand_array;

    cudaMalloc(&rand_array,sizeof(float)*b*t);

clock_gettime(CLOCK_REALTIME,&endtime);

    randtime<<<b,t>>>(rand_array,n);

    cudaThreadSynchronize();

    clock_gettime(CLOCK_REALTIME,&endtime);

    double gputime=

      endtime.tv_sec-starttime.tv_sec+1.0*(endtime.tv_nsec-starttime.tv_nsec)/1000000000;

    printf("%i %i %i %lf %lf\n",n,b,t,cputime,gputime);

}

Ceearem

Btw. the saturated timing matches the actual frequency of the GPU quiet well (1.38GHz). Assuming that 8 threads are done in parallel on each MP, you end up at 4.8 cycles for the multiply add operation measured here. And if I remember correct 4 cycles was the number stated in the Programming Guide.

Ceearem

Thanks for this discussion. This was quite useful.