single thread performance

chickenjun · July 30, 2008, 6:08am

I must say sorry for my poor english first.

I wrote a test application to test the single thread CUDA performance. And compared with CPU. The result is as follows. The result show the GPU is very slow when it is compared with CPU and they all use single thread. Is the result correct?

CPU: Pentium D 2.8x2 (only use one thread)
GPU: GF9600GT
All codes compiled by VS2005.

 int i, j, k = 0;

  for(j = 0; j < 1000000;)
  {
            k += j++;
            k += j++;
            k += j++;
            k += j++;
            k += j++;
            k += j++;
            k += j++;
            k += j++;
            k += j++;
            k += j++;
  }

CPU(Debug): 0ms, CPU(Release): 0ms, GPU: 25.8ms

 int i, j, k = 0;

  for(j = 0; j < 10000000;)
  {
            k += j++;
            k += j++;
            k += j++;
            k += j++;
            k += j++;
            k += j++;
            k += j++;
            k += j++;
            k += j++;
            k += j++;
  }

CPU(Debug): 31ms, CPU(Release): 0ms, GPU:256ms

 int i, j, k = 0;

  for(j = 1; j < 1000000;)
  {
            k *= j++;
            k *= j++;
            k *= j++;
            k *= j++;
            k *= j++;
            k *= j++;
            k *= j++;
            k *= j++;
            k *= j++;
            k *= j++;
  }

CPU(Debug): 0ms, CPU(Release): 0ms, GPU:4.8ms

 int i, j, k = 0;

  for(j = 1; j < 10000000;)
  {
            k *= j++;
            k *= j++;
            k *= j++;
            k *= j++;
            k *= j++;
            k *= j++;
            k *= j++;
            k *= j++;
            k *= j++;
            k *= j++;
  }

CPU(Debug): 62ms, CPU(Release):31ms GPU:45.6ms

 int i, j, k = 0;

  for(j = 1; j < 100000000;)
  {
            k *= j++;
            k *= j++;
            k *= j++;
            k *= j++;
            k *= j++;
            k *= j++;
            k *= j++;
            k *= j++;
            k *= j++;
            k *= j++;
  }

CPU(Debug): 545ms, CPU(Release):374ms GPU:451.7ms

 int i, j, k = 0;

  for(j = 1; j < 10000000; j++)
  {
            k = j;
            k >>= 1;
            k >>= 2;
            k >>= 3;
            k >>= 4;
            k >>= 5;
            k >>= 6;
            k >>= 7;
            k >>= 8;
  }

CPU(Debug):124ms, CPU(Release):0ms GPU:503ms

cbuchner1 · July 30, 2008, 10:44am

This is completely expected.

Per warp you have 1 thread doing work, but 31 inactive threads idling. A warp takes 4 clock cycles or more to execute an arithmetic instruction (8 threads per clock cycle). So for every 4 clock cycles at most 1 instruction gets executed in your thread.

Single precision float arithmetic would be faster than 32 bit integer arithmetics, if I am not mistaken by factor of 4. All of your code worked on 32 bit integers. Try switching to floats and repeat your experiment.

By explicitly using thread #0, #7, #15, #23 of one warp you’d be executing one arithmetic instruction per clock cycle. Which makes it more comparable to a single threaded CPU with a filled instruction pipeline. But to keep things fair you’d have to make each thread loop over 1/4 of the iterations only (as there are now 4 threads doing things more or less concurrently)

Christian

MisterAnderson42 · July 30, 2008, 12:01pm

Yes. Each of the GPU’s SPs is relatively slow. The power of the GPU comes in that it can keep 10,000+ threads running concurrently. Constantly swapping different threads (with no context switch overhead) hides global memory latency, register read after write dependencies, and the such. Thus the GPU is able to run a lot of threads in essentially the same time as a single thread would execute.

Topic		Replies	Views
Performance gap for a short test code between GPU and CPU CUDA Programming and Performance	8	1847	October 26, 2017
Comparing GPUs to CPUs in a particular situation CUDA Programming and Performance	7	17160	April 4, 2011
CUDA is slower than expected. Is something missing? CUDA Programming and Performance cuda , gpu , gpu-computing , parallel-computing	4	206	July 7, 2024
device speed vs. host speed Why is my device program so slow? CUDA Programming and Performance	8	7891	August 16, 2007
CUDA principals - summary CUDA Programming and Performance	0	334	September 1, 2018
Why is my single thread GPU speed 1000x faster than my CPU? CUDA Programming and Performance	14	4798	January 9, 2017
Function is much slower on GPU than on CPU CUDA Programming and Performance cuda	4	561	July 22, 2022
need a help from employees or guys who know compiler well CUDA Programming and Performance	22	8614	December 18, 2008
Measuring speed of a calculation in a single thread CUDA Programming and Performance	6	1126	March 2, 2011
CUDA perormances CUDA Programming and Performance	10	7127	January 22, 2008

single thread performance

Related topics