transpose demo: gpu vs cpu

gpumer · August 8, 2007, 8:18am

hi ereryone,

I modify cuda/projects/transpose.cu ,the demo of CUDA SDK ,use gettimeofday() to measure speed of transpose a matrix on gpu.

code:

   ...

    // execute the kernel

    timeval start;

    gettimeofday(&start,NULL);

    printf("gpu start: %d-%d\n",start.tv_sec,start.tv_usec);

    transpose<<< grid, threads >>>(d_odata, d_idata, size_x, size_y);//half_data_size);

        // check if kernel execution generated and error

    CUT_CHECK_ERROR("Kernel execution failed");

   // copy result from device to    host

    float* h_odata = (float*) malloc(mem_size);

    CUDA_SAFE_CALL( cudaMemcpy( h_odata, d_odata, mem_size,

                                cudaMemcpyDeviceToHost) );

    timeval end;

    gettimeofday(&end,NULL);

    printf("gpu   end: %d-%d\n",end.tv_sec,end.tv_usec);

    float val;

    val=(float)(end.tv_sec-start.tv_sec);

   if(end.tv_usec>=start.tv_usec)

      val+=(float)(end.tv_usec-start.tv_usec)/(float)1000000;

    else

      val+=(float)(end.tv_usec-start.tv_usec+1000000)/(float)1000000;

    printf("gpu run time:%0.6f\n",val);

   gettimeofday(&start,NULL);

    printf("\ncpu start: %d-%d\n",start.tv_sec,start.tv_usec);

    computeGold( reference, h_idata, size_x, size_y);

    gettimeofday(&end,NULL);

    printf("cpu   end: %d-%d\n",end.tv_sec,end.tv_usec);

    val=(float)(end.tv_sec-start.tv_sec);

   if(end.tv_usec>=start.tv_usec)

      val+=(float)(end.tv_usec-start.tv_usec)/(float)1000000;

    else

      val+=(float)(end.tv_usec-start.tv_usec+1000000)/(float)1000000;

    printf("cpu run time:%0.6f\n",val);

    ....

result:

$ transpose

Transposing a 256 by 4096 matrix of floats…

gpu start: 1186558862-22570

gpu end: 1186558862-51328

gpu run time:0.028758

cpu start: 1186558862-51388

cpu end: 1186558862-63622

cpu run time:0.012234

Test PASSED

Press ENTER to exit…

doubt:

Why GPU spend more time than CPU ? How can I use GUP in real time system?

TomV · August 9, 2007, 3:02am

The transpose example shows how you can move around data, but it’s not exactly a good example of how to use the full power of a GPU. Launching a kernel on the GPU requires compilation of the shader, starting up the GPU, executing the kernel, and then copying the data back from the GPU to the CPU. All lot of overhead for just reading and writing a bunch of data.

The last copy from GPU to CPU, by itself, is probably costlier than doing the transpose on the CPU!

If you want compare performance between CPU and GPU, you should look at algorithms that require a lot of floating point calculations. (E.g. the BlackScholes example).

Tom

gpumer · August 9, 2007, 6:57am

Thanks for your reply. You ary right, the copy from GPU to CPU take most time.

I want to use GPU process codec transfer in real time system,but now I think it is improper. GPU have highly parallel computation capablity by itself, it can do complex computation,but readback is too slow.

tachyon_john · August 9, 2007, 3:26pm

It all depends on what the ratio of computation to I/O time is. You can significantly improve the host-GPU transfer speed by using page-locked memory, in some cases by 2-3x, see the NVIDIA sample codes for examples.

John Stone

Topic		Replies	Views
DATA tranfer from CPU to GPU CUDA Programming and Performance	6	4888	April 23, 2008
Confused about GPU vs CPU speed in multiplication CUDA Programming and Performance	8	6625	February 19, 2009
Simple proven (timed) example code where GPU beats CPU, anyone? CUDA Programming and Performance	6	1225	November 1, 2013
Is CUDA really that fast? CUDA Programming and Performance	17	11875	September 21, 2009
Comparison of execution time in CPU and GPU is the CPU better than GPU in execution time ??? CUDA Programming and Performance	6	10580	September 17, 2010
In SDK project the GPU function takes more time than CPU function CUDA Programming and Performance	8	2067	August 17, 2009
How to report speed-up from the GPU vs. CPU? CUDA Programming and Performance	5	15469	January 8, 2011
Multi-threading in host(CPU) CUDA Programming and Performance	2	798	August 26, 2011
Memory Transfer CUDA Programming and Performance	7	3028	October 10, 2008
Transpose example performance problem CUDA Programming and Performance	0	1974	May 12, 2009

transpose demo: gpu vs cpu

Related topics