[Beginner]: CUDA slower than serial implementation fill Operation on entire image

Hi Everybody,

I am new to CUDA and just started experimenting with it. Using

OpenCV, I tried to implement a benchmark for measuring the time

needed to fill an 5MP image with a constant value. I am compiling

with VS 2010 for x64, using a GTX 560 TI.

My “Release” timing results are:

[i]

CPU Set: 3.145791 (ms)

GPU Set: 31.395227 (ms)

[/i]

Since the GPU version is so slow, I must have forgotten some important points.

I would appreciate it, if someone could point it out to me.

Regards!

Code (file:main.cpp):

extern "C" void cuSetValue(byte*, int, int, size_t, byte);

int main(){

	StopWatch myTimer;

	unsigned int timer;

const int FILL_VALUE = 128;

	try {	

		cv::Mat image  = cv::imread("a1_Referenz.png",0);

// Serial version using openCV

		myTimer.startTimer();

			image = FILL_VALUE;

		myTimer.stopTimer();

		printf( "CPU Set: %f (ms)\n", myTimer.getElapsedTime());	

// CUDA Version, first call is somehow slow...

		cuSetValue(image.data, image.rows, image.cols, image.total()*image.elemSize(), FILL_VALUE);

// hence time second call...

		CUT_SAFE_CALL(cutCreateTimer(&timer));

		CUT_SAFE_CALL(cutStartTimer(timer));

			cuSetValue(image.data, image.rows, image.cols, image.total()*image.elemSize(),  FILL_VALUE);

		CUT_SAFE_CALL(cutStopTimer(timer));

		printf( "GPU Set: %f (ms)\n", cutGetTimerValue(timer));

                CUT_SAFE_CALL(cutDeleteTimer(timer))		

	} catch(const cv::Exception& e) {

		printf("%s", e.what());

	}

	while (1) if ('\n' == getchar()) break;

	return 0;

 }

(file: cuSetValue.cu)

__global__ void myKernel( unsigned char *ptr, unsigned char value ) {

	// map from threadIdx/BlockIdx to pixel position

	int x = blockIdx.x;

	int y = blockIdx.y;

	int offset = x + y * gridDim.x;

	ptr[offset] = value;

}

extern "C" void cuSetValue(unsigned char* image, int width, int height, size_t byteCount, unsigned char value) {

	unsigned char* pImgD;

	cutilSafeCall( cudaMalloc(  (void**)&pImg_d, width*height ) );

	dim3 grid(height,width);

	myKernel<<< grid,1 >>>( pImgD, value );

	cutilSafeCall( cudaMemcpy( image, pImgD, byteCount, cudaMemcpyDeviceToHost ) );

	cudaFree( pImg_d );

}

Command Line nvcc

You need to use more than one thread per block. Try different multiples of 64 and see which is fastest.

Hi,

thanks for the reply, it helped alot. I changed the kernel calling function to:

extern "C" void cuSetValue(unsigned char* image, int width, int height, size_t byteCount, unsigned char value) {

	unsigned char* pImg_d;

	cutilSafeCall( cudaMalloc(  (void**)&pImg_d, width*height ) );

	dim3 threadsPerBlock(32, 32);  // 1024 threads/block

	dim3 numBlocks(width/threadsPerBlock.x, height/threadsPerBlock.y); 

	myKernel<<< numBlocks,threadsPerBlock >>>( pImg_d, value );

	cutilSafeCall( cudaMemcpy( image, pImg_d, byteCount, cudaMemcpyDeviceToHost ) );

	cudaFree( pImg_d );

}

what in fact gives me a better result:

However, I am kind of disappointed since I expected the acceleration to

be higher!? In fact, when I benchmark the openCV threshold function (CPU version),

it needs only 1.1 ms to precess the entire image ( image is 2432x2176 ), even though it needs

more operations than my simple set operation. I guess I am still missing some important points using CUDA?

Writing individual chars don’t coalesce.

This is why you are not seeing much of an improvement.

You need to switch to 32 bit integer access into the bitmap for best speed.

Learn about the coalescing requirements from the CUDA programming guide. A speedup in the order of one magnitude may be possible.

I suppose you change the kernel function as well to calculate one pixel per thread rather than per block? Need to see how you order your thread in a 2d thread block. The memory write order is important. You want your thread 1-16 write to continue memory locations.

Thanks for the help.

I changed the whole thing to operate on floats (4 bytes on my machine) and adapted the kernel

according to your advice. Anyway, the processing time raised now to 7ms (release). Another strange thing is

when I change the block size to 64x64 or above, it doesn’t work anymore.

#include <cutil_inline.h>

__global__ void myKernel( float *ptr, float value ) {

        int x = threadIdx.x + blockIdx.x * blockDim.x;

	int y = threadIdx.y + blockIdx.y * blockDim.y;

	int pitch = blockDim.x * gridDim.x;

	int idx = x + y * pitch;

ptr[idx] = value;

}

extern "C" void cuSetValue(float* image, int width, int height, size_t byteCount, float value) {

        float* pImgD;

cutilSafeCall( cudaMalloc(  (void**)&pImgD, byteCount ) );

dim3 threadsPerBlock(32, 32);  // 1024 threads/block

        dim3 numBlocks(width/threadsPerBlock.x, height/threadsPerBlock.y); 

myKernel<<< numBlocks,threadsPerBlock >>>( pImgD, value );

cutilSafeCall( cudaMemcpy( image, pImgD, byteCount, cudaMemcpyDeviceToHost ) );

cudaFree( pImgD );

}

and maybe the main is of importance too:

extern "C" void cuSetValue(float*, int, int, size_t, float);

int main(){

        StopWatch myTimer;

        unsigned int timer;

const float FILL_VALUE = 0.5;

try {   

                cv::Mat image  = cv::imread("a1_Referenz.png",0);

		cv::Mat image32; image.convertTo(image32,CV_32FC1,1/255);

// Serial version using openCV

                myTimer.startTimer();

                     image = FILL_VALUE;

                myTimer.stopTimer();

                printf( "CPU Set: %f (ms)\n", myTimer.getElapsedTime());        

		size_t byteCount = image32.total()*image32.elemSize();

cuSetValue((float*)image32.data, image.rows, image.cols, byteCount, FILL_VALUE);

// time second call...

                CUT_SAFE_CALL(cutCreateTimer(&timer));

                CUT_SAFE_CALL(cutStartTimer(timer));

                        cuSetValue((float*)image32.data, image.rows, image.cols, byteCount,  FILL_VALUE);

                CUT_SAFE_CALL(cutStopTimer(timer));

printf( "GPU Set: %f (ms)\n", cutGetTimerValue(timer));

                CUT_SAFE_CALL(cutDeleteTimer(timer))            

		//cv::imshow("Result", image32);

		//cv::waitKey();

} catch(const cv::Exception& e) {

                printf("%s", e.what());

        }

while (1) if ('\n' == getchar()) break;

return 0;

 }

The cuda has maximum thread per block restriction. I remember it is 512 but recently increased to 1024 for compute capability card 2.x above.

Also, the suggestion asking you to operate on floats (or int) is to still save the image in char, but typecast the pointer as int* or float*. In your kernel, each thread process the 4 bytes in that float or int and save it back. Sort of like

global void myKernel( float *ptr, char value ) {

    int x = threadIdx.x + blockIdx.x * blockDim.x;

int y = threadIdx.y + blockIdx.y * blockDim.y;

int pitch = blockDim.x * gridDim.x;

int idx = x + y * pitch;

    int valuetemp=value;

    ptr[idx] = valuetemp+valuetemp<<8+valuetemp<<16+valuetemp<<24;

}

Note that by typecasting, your image size will reduce by a factor of 4. Also, you don’t have to stick with 2D in this case, just treat it as 1D array and use 1D blocks. Much easier to calculate index.

And probably also faster due to more continuous memory access patterns (more accesses to already open pages).

5MP = 5 Mo or 15Mo

because 5Mo give 1.6Go/s in memory for the cpu that good

on a gpu you can go to 50 go/s but calling a empty thread on gpu take ms so your 2.7Ms is good
try with a bigger image

use only 5MO / 4 =1 250 000 thread

like

<<< dim3(1024,5,1),dim3(64,4,1) >>>

Hey,

I think I got the point thanks to springc.

Here is the adapted code, I am using 64bit loads instead of 32bit:

typedef  long long int uint64;

__global__ void myKernel(uint64* ptr, unsigned char value ) {

        int idx = threadIdx.x + blockIdx.x * blockDim.x;

	uint64 tValue = value;

		

	for (int i = 8; i < sizeof(uint64)*8; i+=8)

		tValue |=  tValue<<i;

	ptr[idx] = tValue;

}

extern "C" void cuSetValue(unsigned char* image, int byteCount, unsigned char value) {

        uint64* pImgD;

cutilSafeCall( cudaMalloc(  (void**)&pImgD, byteCount ) );

dim3 threadsPerBlock(64);

	dim3  numBlocks( ( byteCount/sizeof(uint64) )/threadsPerBlock.x );

myKernel<<< numBlocks,threadsPerBlock >>>( pImgD, value );

cutilSafeCall( cudaMemcpy( image, pImgD, byteCount, cudaMemcpyDeviceToHost ) );

cudaFree( pImgD );

}

The code needs in average ~1.9ms to process the image. Is that it, or do I still have place for improvement?

yours

unroll the for loop.

increase the threadsPerBlock.

And like other’s mentioned. use larger image to highlight the difference between cpu and GPU. In theory, your code should be almost as fast as the memory copy in GPU, since in your case, the calculation is mininum.

I suggest you to read sdk examples and see how they optimize. A good example is the matrix transpose example. The whitepaper is here:

http://developer.download.nvidia.com/compute/DevZone/C/html/C/src/transpose/doc/MatrixTranspose.pdf

See how they start with a simple copy kernel and use that performance as a baseline and optimize their matrix transpose against it. You will be amazed how they can improve over their naive implementation of the matrix transpose to approach the performance of simple copy. Note that you can skip the partition camping as the new architechure overcomes this issue.

for 5 000 000 of float so 20Mo take 0.50 ms on 8800 gtx ==> 40go/s my cg can do 50go/s max

#include <stdio.h>  

 #include <cuda.h>  

 #include <time.h>

 #include <math.h>

 #include "cutil_inline.h"

__global__ void square_array(float *a,float b,int N)  

 {  

	    

	    int sh=  threadIdx.x+threadIdx.y*32; 

	    int id =  512*blockIdx.x + 1048576* blockIdx.y + sh;

if ( id<N)

	{

	a[id]=b;

}  

}

// main routine that executes on the host  

 int main(void)  

 {  

   float *memoirecpu1, *memoiregraphique1;  // Pointer to host & device arrays  

   cudaEvent_t start, stop;

FILE *stream, *stream2;

int N;

N=5000000;

size_t size = N * sizeof(float);  

   memoirecpu1 = (float *)malloc(size);        // Allocate array on host  

cutilSafeCall( cudaEventCreate(&start) );

    cutilSafeCall( cudaEventCreate(&stop)  );

    unsigned int timer;

    cutilCheckError(  cutCreateTimer(&timer)  );

    cutilCheckError(  cutResetTimer(timer)    );

    cutilSafeCall( cudaThreadSynchronize() );

    float gpu_time = 0.0f;

//---------------------------

  cudaMalloc((void **) &memoiregraphique1, size);   // Allocate array on device  

   cutilCheckError( cutStartTimer(timer) );

float f=6.0;

cudaEventRecord(start, 0);

     // only change the 5 

     // 2048 *32 =65536 max of my cg

     // 32*16 = 512 max of my cg  

     // 2048*5*32*16 =   5 242 880 thread 

     // with 65 can do  68 157 000 float

square_array <<< dim3 (2048,5,1),dim3(32,16,1)  >>> (memoiregraphique1, f,N);  

    cudaEventRecord(stop, 0);

    unsigned long int counter=0;

    while( cudaEventQuery(stop) == cudaErrorNotReady )

    {

        counter++;

    }

    cutilSafeCall( cudaEventElapsedTime(&gpu_time, start, stop) );        

    printf("time spent executing by the GPU: %.6f\n", gpu_time);

cutilCheckError( cutStopTimer(timer) );

    printf("time spent by CPU in CUDA calls: %.2f\n", cutGetTimerValue(timer) );

   cudaMemcpy(memoirecpu1, memoiregraphique1, size, cudaMemcpyDeviceToHost);  

for (int i=0; i<N; i=i+1) 

 {

   if (memoirecpu1[i]!= 6.0)

   printf("%d %f\n", i, memoirecpu1[i]);  

}

//------------------------------

free(memoirecpu1); cudaFree(memoiregraphique1);  

}

Thanks again to both of you.

I’ll do some reading now and post my results once I understood the concepts.
Thanks for the great link btw!

Hello,

it turned out, thatthe major bottle neck is the copy process from the

device back to the host. When just the time of the kernel to run is measured:

#include <cutil_inline.h>

typedef  long long int uint64;

__global__ void myKernel(uint64* ptr, uint64 value, const int myIts) {

        int idx = threadIdx.x + blockIdx.x * blockDim.x;

	if( idx < myIts)

	    ptr[idx] = value;

}

extern "C" void cuSetValue(unsigned char* image, int width, int height, int byteCount, unsigned char value) {

        uint64* pImgD;

	unsigned int timer;

	uint64 tValue = value;

	for (int i = 8; i < sizeof(uint64)*8; i+=8)

	      tValue |=  tValue<<i;

cutilSafeCall( cudaMalloc(  (void**)&pImgD, byteCount ) );

dim3 threadsPerBlock(256);

	dim3  numBlocks( ( byteCount/sizeof(uint64) )/threadsPerBlock.x );

	CUT_SAFE_CALL(cutCreateTimer(&timer));

        CUT_SAFE_CALL(cutStartTimer(timer));

	     myKernel<<< numBlocks,threadsPerBlock >>> ( pImgD, tValue, byteCount );

	CUT_SAFE_CALL(cutStopTimer(timer));

printf( "GPU Set: %.4f (ms)\n", cutGetTimerValue(timer));

        CUT_SAFE_CALL(cutDeleteTimer(timer))            

	cutilSafeCall( cudaMemcpy( image, pImgD, byteCount, cudaMemcpyDeviceToHost ) );

        cudaFree( pImgD );

}

the result is ~8.4 micro seconds of process time without copying.

I tried to compare this using the metric introduced in the white paper:

imageSize = 2432 x 2176 Byte => 5,29 Mbyte

timing = 0,0084 ms

effective bandwidth = 2* imagesize10^-3 / timing10^-3 = 1260 Gb/s

Seems to me to be a little bit too high compared to cricri1’s 40Gb/s

yours

use timer like me with start and stop

that gives me 0.048 ms => ~218 Gb/s resp, 109 Gb/s when I skip the multiplication by 2.

Last question, promised =) :

What is the difference when using “cutCreateTimer” and “cudaEventCreate”?

I can’t find any documentation for “cutCreateTimer”.

yours

must skip*2 because you dont read and write but only write to memory

so 109GO/s

and cant help you with timer i am biginner found on internet what i know

The reason that you can’t find any documentation for “cutCreateTimer” is that it’s part of the cutil library, which is a utility library for use by the SDK; it’s not intended for the end user to use those functions. Use the cuda events for timing if you have a choice in the matter.

EDIT: For the sake of completeness; you can find the documentation for the cutil functions in cutil.h. The ones relevant to timing are:

////////////////////////////////////////////////////////////////////////////

    //! Timer functionality

////////////////////////////////////////////////////////////////////////////

    //! Create a new timer

    //! @return CUTTrue if a time has been created, otherwise false

    //! @param  name of the new timer, 0 if the creation failed

    ////////////////////////////////////////////////////////////////////////////

    DLL_MAPPING

    CUTBoolean CUTIL_API 

    cutCreateTimer( unsigned int* name);

////////////////////////////////////////////////////////////////////////////

    //! Delete a timer

    //! @return CUTTrue if a time has been deleted, otherwise false

    //! @param  name of the timer to delete

    ////////////////////////////////////////////////////////////////////////////

    DLL_MAPPING

    CUTBoolean CUTIL_API 

    cutDeleteTimer( unsigned int name);

////////////////////////////////////////////////////////////////////////////

    //! Start the time with name \a name

    //! @param name  name of the timer to start

    ////////////////////////////////////////////////////////////////////////////

    DLL_MAPPING

    CUTBoolean CUTIL_API 

    cutStartTimer( const unsigned int name);

////////////////////////////////////////////////////////////////////////////

    //! Stop the time with name \a name. Does not reset.

    //! @param name  name of the timer to stop

    ////////////////////////////////////////////////////////////////////////////

    DLL_MAPPING

    CUTBoolean CUTIL_API 

    cutStopTimer( const unsigned int name);

////////////////////////////////////////////////////////////////////////////

    //! Resets the timer's counter.

    //! @param name  name of the timer to reset.

    ////////////////////////////////////////////////////////////////////////////

    DLL_MAPPING

    CUTBoolean CUTIL_API 

    cutResetTimer( const unsigned int name);

////////////////////////////////////////////////////////////////////////////

    //! Returns total execution time in milliseconds for the timer over all 

    //! runs since the last reset or timer creation.

    //! @param name  name of the timer to return the time of

    ////////////////////////////////////////////////////////////////////////////

    DLL_MAPPING

    float CUTIL_API 

    cutGetTimerValue( const unsigned int name);

////////////////////////////////////////////////////////////////////////////

    //! Return the average time in milliseconds for timer execution as the 

    //! total  time for the timer dividied by the number of completed (stopped)

    //! runs the timer has made.

    //! Excludes the current running time if the timer is currently running.

    //! @param name  name of the timer to return the time of

    ////////////////////////////////////////////////////////////////////////////

    DLL_MAPPING

    float CUTIL_API 

    cutGetAverageTimerValue( const unsigned int name);

but again, just keep in mind that “CUTIL is not part of the CUDA Toolkit and is not supported by NVIDIA.”