Timing the code

Hi Gys,

I am facing a strange bug in my programme. Recently i got a new GC GTS 450 and my old one was 8400 GS. On my old one the timing of the following code was around 0.6 msec. And now on my new GC the timing shoots upto 2.2 msec. The code is as follows:

void gpu_blur( gpu_context_t *ctx , int KERNEL_RADIUS)

{	

	assert(KERNEL_RADIUS);

	gpu_error_t error = GPU_OK;

	struct timeval startTime;

        struct timeval endTime;  

	

	float elapsedtime;

	cudaEvent_t start, stop;

	cudaEventCreate(&start);

	cudaEventCreate(&stop);

	cudaEventRecord(start,0);

        gettimeofday(&startTime, NULL);

	int KERNEL_LENGTH = (2 * KERNEL_RADIUS + 1);

	const int imageW = ctx->width;

        const int imageH = ctx->height;

    	

	float *tempKernel;

	unsigned char *in;

	tempKernel = (float *)malloc(KERNEL_LENGTH * sizeof(float));

	in = ctx->output_buffer_1;

	

	cudaArray *src;

        cudaChannelFormatDesc floatTex = cudaCreateChannelDesc<unsigned char>();

        cudaMallocArray(&src, &floatTex, imageW, imageH);

unsigned char *tempOutput;

        cudaMalloc((void **)&tempOutput, imageW * imageH );   

	error = checkCudaError();

	

	////////////// calculating kernel //////////////

	float sum = 0;

        for(int i = 0; i < KERNEL_LENGTH; i++)

        {

    	    float dist = (float)(i - KERNEL_RADIUS) / (float)KERNEL_RADIUS;

            tempKernel[i] = expf(- dist * dist / 2);

    	    sum += tempKernel[i];

        }

        for(int i = 0; i < KERNEL_LENGTH; i++)

            tempKernel[i] /= sum;            

	cudaMemcpyToSymbol(Kernel, tempKernel, KERNEL_LENGTH * sizeof(float));       

	////////////////////////////////////////////////

cudaMemcpyToArray(src, 0, 0, in, imageW * imageH, cudaMemcpyHostToDevice);

        convolutionRowsGPU( tempOutput, src, imageW, imageH, KERNEL_RADIUS, KERNEL_LENGTH, ctx->threadsX, ctx->threadsY);

	if(checkCudaError() == GPU_OK)   

 	{

    	cudaMemcpyToArray(src, 0, 0, tempOutput, imageW * imageH, cudaMemcpyDeviceToDevice);

    	convolutionColumnsGPU( tempOutput, src, imageW, imageH, KERNEL_RADIUS, KERNEL_LENGTH, ctx->threadsX, ctx->threadsY);		

	}

	gettimeofday(&endTime, NULL);                                   ///////// As soon as i go past this line, the timing shoots upto 2.2ms, before this line timing is 0.3ms. ////////

/********Problem starts here*******/

	cudaMemcpy(in, tempOutput, imageW * imageH, cudaMemcpyDeviceToHost);

	cudaMemcpy( ctx->gpu_buffer_1, tempOutput, imageW * imageH, cudaMemcpyDeviceToDevice);

	error = checkCudaError();

	cudaFree(tempOutput);

	cudaFreeArray(src);

	double tS = startTime.tv_sec*1000000 + (startTime.tv_usec);

        double tE = endTime.tv_sec*1000000  + (endTime.tv_usec);

        fprintf(stderr,"Smoothing_GPU_1:%lf \n",(tE-tS)/1000);

	

	cudaEventRecord(stop,0);

	cudaEventSynchronize(stop);

	cudaEventElapsedTime(&elapsedtime,start,stop);

	cudaEventDestroy(start);

	cudaEventDestroy(stop);

	

	fprintf(stderr,"Smoothing_GPU:%lf \n",elapsedtime);

}

Each of the kernel calls are finished within 0.3ms time. But as soon as i go to cudaMemcpy line of code the timings shoots to 2.2ms.

Is there something wrong with my approach like should i use cudaThreadSynchronize ? Will __syncthreads() work instead of cudaThreadSynchronize() ?

Common people, Isnt there anyone who can respond to my query ???

Does a second call to the same kernel take the same time or is it faster? How do you compile the code? If your binary does not include code for sm_21 or sm_20, the just-in-time compiler is invoked when a kernel is first used.

I am working on ubuntu 10.04 with cuda toolkit 3.2. The compiler I am using is nvcc but since it is a part of project so i am using gcc separately for abc.cu.cpp files.

These are the flags I am using with nvcc compiler --cuda -g -G -arch sm_13.

I would also like to tell you that if i place my timing code here:

if(checkCudaError() == GPU_OK)   

        {

        cudaMemcpyToArray(src, 0, 0, tempOutput, imageW * imageH, cudaMemcpyDeviceToDevice);

        convolutionColumnsGPU( tempOutput, src, imageW, imageH, KERNEL_RADIUS, KERNEL_LENGTH, ctx->threadsX, ctx->threadsY);            

convolutionColumnsGPU( tempOutput, src, imageW, imageH, KERNEL_RADIUS, KERNEL_LENGTH, ctx->threadsX, ctx->threadsY);  /////// This is the redundant kernel call ///////////          

}

        gettimeofday(&endTime, NULL);                                   ///////// As soon as i go past this line, the timing shoots upto 2.2ms, before this line timing is 0.3ms. ////////

///// replaced timing code ////////

        double tS = startTime.tv_sec*1000000 + (startTime.tv_usec);

        double tE = endTime.tv_sec*1000000  + (endTime.tv_usec);

        fprintf(stderr,"Smoothing_GPU_1:%lf \n",(tE-tS)/1000);

/********Problem starts here*******/

        cudaMemcpy(in, tempOutput, imageW * imageH, cudaMemcpyDeviceToHost);

        cudaMemcpy( ctx->gpu_buffer_1, tempOutput, imageW * imageH, cudaMemcpyDeviceToDevice);

        error = checkCudaError();

cudaFree(tempOutput);

        cudaFreeArray(src);

The timing of smoothing_gpu_1 is 0.3ms and smoothing_gpu is 2.2ms.

And after an extra call to convolutionColumnsGPU(as shown) the timing remains almost same. I cant tell the difference.

Compile with [font=“Courier New”]-arch sm_21[/font] when running on the GTS 450. Or compile with [font=“Courier New”]-gencode arch=sm_10 -gencode arch=sm_21[/font] to produce a fat binary with code for both cards.

These are the flags to nvcc: $(NVCC) --cuda -g -G -gencode arch=sm_20

The error i get is: nvcc fatal : Option ‘–generate-code arch=sm_20’, missing code

And i compiled with -arch sm_21 but the timing is still the same. I cant compile my code on old graphic card (I dont have it anymore).

I also found something interesting. If i time my code as shown below:

if(checkCudaError() == GPU_OK)   

        {

        cudaMemcpyToArray(src, 0, 0, tempOutput, imageW * imageH, cudaMemcpyDeviceToDevice);

        convolutionColumnsGPU( tempOutput, src, imageW, imageH, KERNEL_RADIUS, KERNEL_LENGTH, ctx->threadsX, ctx->threadsY);            

        }

        gettimeofday(&endTime, NULL);                                   ///////// As soon as i go past this line, the timing shoots upto 2.2ms, before this line timing is 0.3ms. ////////

          double tS = startTime.tv_sec*1000000 + (startTime.tv_usec);

        double tE = endTime.tv_sec*1000000  + (endTime.tv_usec);

        fprintf(stderr,"Smoothing_GPU_1:%lf \n",(tE-tS)/1000);

//////////// Replaced Lines //////////////

        cudaEventRecord(stop,0);

        cudaEventSynchronize(stop);

        cudaEventElapsedTime(&elapsedtime,start,stop);

        cudaEventDestroy(start);

        cudaEventDestroy(stop);

        /////////////////////////////////////////

/********Problem starts here*******/

        cudaMemcpy(in, tempOutput, imageW * imageH, cudaMemcpyDeviceToHost);

        cudaMemcpy( ctx->gpu_buffer_1, tempOutput, imageW * imageH, cudaMemcpyDeviceToDevice);

        error = checkCudaError();

cudaFree(tempOutput);

        cudaFreeArray(src);

Then the timing of same code by CPU calls is 0.13ms and with cuda timing calls is 1.7ms.

Whats happening there ??