single thread performance

I must say sorry for my poor english first.

I wrote a test application to test the single thread CUDA performance. And compared with CPU. The result is as follows. The result show the GPU is very slow when it is compared with CPU and they all use single thread. Is the result correct?

CPU: Pentium D 2.8x2 (only use one thread)
GPU: GF9600GT
All codes compiled by VS2005.

  1.  int i, j, k = 0;
    
      for(j = 0; j < 1000000;)
      {
                k += j++;
                k += j++;
                k += j++;
                k += j++;
                k += j++;
                k += j++;
                k += j++;
                k += j++;
                k += j++;
                k += j++;
      }
    

CPU(Debug): 0ms, CPU(Release): 0ms, GPU: 25.8ms

  1.  int i, j, k = 0;
    
      for(j = 0; j < 10000000;)
      {
                k += j++;
                k += j++;
                k += j++;
                k += j++;
                k += j++;
                k += j++;
                k += j++;
                k += j++;
                k += j++;
                k += j++;
      }
    

CPU(Debug): 31ms, CPU(Release): 0ms, GPU:256ms

  1.  int i, j, k = 0;
    
      for(j = 1; j < 1000000;)
      {
                k *= j++;
                k *= j++;
                k *= j++;
                k *= j++;
                k *= j++;
                k *= j++;
                k *= j++;
                k *= j++;
                k *= j++;
                k *= j++;
      }
    

CPU(Debug): 0ms, CPU(Release): 0ms, GPU:4.8ms

  1.  int i, j, k = 0;
    
      for(j = 1; j < 10000000;)
      {
                k *= j++;
                k *= j++;
                k *= j++;
                k *= j++;
                k *= j++;
                k *= j++;
                k *= j++;
                k *= j++;
                k *= j++;
                k *= j++;
      }
    

CPU(Debug): 62ms, CPU(Release):31ms GPU:45.6ms

  1.  int i, j, k = 0;
    
      for(j = 1; j < 100000000;)
      {
                k *= j++;
                k *= j++;
                k *= j++;
                k *= j++;
                k *= j++;
                k *= j++;
                k *= j++;
                k *= j++;
                k *= j++;
                k *= j++;
      }
    

CPU(Debug): 545ms, CPU(Release):374ms GPU:451.7ms

  1.  int i, j, k = 0;
    
      for(j = 1; j < 10000000; j++)
      {
                k = j;
                k >>= 1;
                k >>= 2;
                k >>= 3;
                k >>= 4;
                k >>= 5;
                k >>= 6;
                k >>= 7;
                k >>= 8;
      }
    

CPU(Debug):124ms, CPU(Release):0ms GPU:503ms

This is completely expected.

Per warp you have 1 thread doing work, but 31 inactive threads idling. A warp takes 4 clock cycles or more to execute an arithmetic instruction (8 threads per clock cycle). So for every 4 clock cycles at most 1 instruction gets executed in your thread.

Single precision float arithmetic would be faster than 32 bit integer arithmetics, if I am not mistaken by factor of 4. All of your code worked on 32 bit integers. Try switching to floats and repeat your experiment.

By explicitly using thread #0, #7, #15, #23 of one warp you’d be executing one arithmetic instruction per clock cycle. Which makes it more comparable to a single threaded CPU with a filled instruction pipeline. But to keep things fair you’d have to make each thread loop over 1/4 of the iterations only (as there are now 4 threads doing things more or less concurrently)

Christian

Yes. Each of the GPU’s SPs is relatively slow. The power of the GPU comes in that it can keep 10,000+ threads running concurrently. Constantly swapping different threads (with no context switch overhead) hides global memory latency, register read after write dependencies, and the such. Thus the GPU is able to run a lot of threads in essentially the same time as a single thread would execute.