cuda gpu slower than cpu

qnoob · April 30, 2012, 3:23am

Hello, I am having trouble figuring out why my cuda code runs slower than my cpu code

my desktop configuration is i7 2600S, geforce 560ti

and my code is as follows:

int** kernel_shiftSeam(int **MCEnergyMat, int **newE, int *seam, int width, int height, int direction)
{
//time measurement
float elapsed_time_ms = 0;
cudaEvent_t start, stop; //threads per block

dim3 threads(16,16);
//blocks
dim3 blocks((width+threads.x-1)/threads.x, (height+threads.y-1)/threads.y);

int *device_Seam;

int *host_Seam;

int seamSize;
if(direction == 1)
{
	seamSize = height*sizeof(int);
	host_Seam = (int*)malloc(seamSize);
	for(int i=0;i<height;i++)
		host_Seam[i] = seam[i];
}
else
{
	seamSize = width*sizeof(int);
	host_Seam = (int*)malloc(seamSize);
	for(int i=0;i<width;i++)
		host_Seam[i] = seam[i];
}

cudaMalloc((void**)&device_Seam, seamSize);
cudaMemcpy(device_Seam, host_Seam, seamSize, cudaMemcpyHostToDevice);

global_host_MC = MCEnergyMat;
new_host_MC = newE;

//copy host array to device
cudaMemcpy(global_MC, global_MC2, sizeof(int*)*width, cudaMemcpyHostToDevice);
for(int i=0;i<width;i++)
	cudaMemcpy(global_MC2[i], global_host_MC[i], sizeof(int)*height, cudaMemcpyHostToDevice);
	
cudaMemcpy(new_MC, new_MC2, sizeof(int*)*width, cudaMemcpyHostToDevice);
for(int i=0;i<width;i++)
	cudaMemcpy(new_MC2[i], new_host_MC[i], sizeof(int)*height, cudaMemcpyHostToDevice);


cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);

    //do some operations on the 2d matrix
gpu_shiftSeam<<< blocks,threads >>>(global_MC, new_MC, device_Seam, width, height);

//measure end time for cpu calcuations
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsed_time_ms, start, stop );

execTime += elapsed_time_ms;

//copy out the data back to host (RESULT)
for(int i=0;i<width;i++)
{
	cudaMemcpy(newE[i], new_MC2[i], sizeof(int)*height, cudaMemcpyDeviceToHost);
}

return newE;

}

I looped it 800 times and I got the follow results:

GPU
Computation Time (the gpu_shiftseam part) : 1176ms
Total program run time: 22s

CPU
Computation Time (same operation as gpu_shiftseam but on host) : 12522ms
Total program run time: 12s

Apparently the GPU computation time is way shorter than the one on CPU, but
for some reason the total program run time for gpu is a lot longer, does
anyone know why? Is it because of the number of threads/blocks I am assigning
is incorrect? Or is the “slowness” coming from allocating memory on device?

Thanks a lot!

pasoleatis · April 30, 2012, 9:33am

There is overhead with “booting” the card and memory allocations. Try to measure the total time of execution without the shift part just to see how much are those.

parallelis · May 1, 2012, 3:36pm

This is mainly the memory copy operation, and the fact that you don’t interleave memory copy and Kernel execution.

Each cudaMemcpy also have a big overhead when copying few data, I suggest to prepare a huge single block and copy it with just one cudaMemcpy() instead looping, and interleave block preparation and copy on CPU while kernel is running on the GPU.

Topic		Replies	Views
CUDA slower than CPU? CUDA Programming and Performance	7	1008	August 18, 2023
CUDA trouble CUDA Programming and Performance	3	1049	March 19, 2013
Performance in basic algorithm Why isn't faster? CUDA Programming and Performance	4	1754	January 9, 2009
GPU is slower than CPU CUDA Programming and Performance	13	18569	November 4, 2025
GPU vs. CPU GPU is always much slower CUDA Programming and Performance	1	10352	June 5, 2009
Simple proven (timed) example code where GPU beats CPU, anyone? CUDA Programming and Performance	6	1302	November 1, 2013
GPU is slower than CPU process from nvprof CUDA Programming and Performance	0	419	December 13, 2018
cuda is really slow - even when doing nothing CUDA Programming and Performance	10	2542	September 3, 2010
faster at small runtimes, slower for larger runtimes CUDA Programming and Performance	1	788	June 4, 2010
Why is the transfer rate on gpu so much slower than on cpu when executing send and recv CUDA Programming and Performance cuda	1	413	March 7, 2023

cuda gpu slower than cpu

Related topics