transpose demo: gpu vs cpu

hi ereryone,

I modify cuda/projects/ ,the demo of CUDA SDK ,use gettimeofday() to measure speed of transpose a matrix on gpu.



    // execute the kernel

    timeval start;


    printf("gpu start: %d-%d\n",start.tv_sec,start.tv_usec);

    transpose<<< grid, threads >>>(d_odata, d_idata, size_x, size_y);//half_data_size);

        // check if kernel execution generated and error

    CUT_CHECK_ERROR("Kernel execution failed");

   // copy result from device to    host

    float* h_odata = (float*) malloc(mem_size);

    CUDA_SAFE_CALL( cudaMemcpy( h_odata, d_odata, mem_size,

                                cudaMemcpyDeviceToHost) );

    timeval end;


    printf("gpu   end: %d-%d\n",end.tv_sec,end.tv_usec);

    float val;






    printf("gpu run time:%0.6f\n",val);


    printf("\ncpu start: %d-%d\n",start.tv_sec,start.tv_usec);

    computeGold( reference, h_idata, size_x, size_y);


    printf("cpu   end: %d-%d\n",end.tv_sec,end.tv_usec);






    printf("cpu run time:%0.6f\n",val);



$ transpose

Transposing a 256 by 4096 matrix of floats…

gpu start: 1186558862-22570

gpu end: 1186558862-51328

gpu run time:0.028758

cpu start: 1186558862-51388

cpu end: 1186558862-63622

cpu run time:0.012234


Press ENTER to exit…


Why GPU spend more time than CPU ? How can I use GUP in real time system?

The transpose example shows how you can move around data, but it’s not exactly a good example of how to use the full power of a GPU. Launching a kernel on the GPU requires compilation of the shader, starting up the GPU, executing the kernel, and then copying the data back from the GPU to the CPU. All lot of overhead for just reading and writing a bunch of data.

The last copy from GPU to CPU, by itself, is probably costlier than doing the transpose on the CPU!

If you want compare performance between CPU and GPU, you should look at algorithms that require a lot of floating point calculations. (E.g. the BlackScholes example).


Thanks for your reply. You ary right, the copy from GPU to CPU take most time.

I want to use GPU process codec transfer in real time system,but now I think it is improper. GPU have highly parallel computation capablity by itself, it can do complex computation,but readback is too slow.

It all depends on what the ratio of computation to I/O time is. You can significantly improve the host-GPU transfer speed by using page-locked memory, in some cases by 2-3x, see the NVIDIA sample codes for examples.

John Stone