Cannot find a reason why CPU process much faster than GPU process in simple code

arer90 · November 19, 2018, 3:08am

Hello, developer.
I want to make a simple code for showing GPU process is much faster than CPU process.
But I got different result which is CPU process is much faster than GPU process.
I know this is wrong result but i don’t know why…

I’m using a Visual Studio 2017 with CUDA 10.0 toolkit.

--- General Information for device 0 ---
Name:  GeForce GTX 1060
Compute capability:  6.1
Clock rate:  1670500
Device copy overlap:  Enabled
Kernel execution timeout :  Enabled
   --- Memory Information for device 0 ---
Total global mem:  -2147483648
Total constant Mem:  65536
Max mem pitch:  2147483647
Texture Alignment:  512
   --- MP Information for device 0 ---
Multiprocessor count:  10
Shared mem per mp:  49152
Registers per mp:  65536
Threads in warp:  32
Max threads per block:  1024
Max thread dimensions:  (1024, 1024, 64)
Max grid dimensions:  (2147483647, 65535, 65535)

#include <iostream>
#include <time.h>
#include <windows.h>

#include <cuda_runtime.h>
#include <device_launch_parameters.h>

#define N 10
//#define N 10000

using namespace std;

__global__ void add2(int *a, int *b, int *c) {
	// GPU block from grid sector
	int tid = blockIdx.x;		// checking the data of index  = if you insert min of N, you will get slow result from CPU. But if you put big number, this show much faster than CPU
	
	// GPU thread
	//int tid = threadIdx.x;	// Same result as blockIdx.x
	
	// GPU unexpected vector	// Same result as above
	//int tid = threadIdx.x + blockIdx.x*blockDim.x;
	if (tid < N) {
		c[tid] = a[tid] + b[tid];
	}
}

void add(int *a, int *b, int *c) {
	int tid = 0;
	while (tid < N) {
		c[tid] = a[tid] + b[tid];
		tid += 1;
	}
}

int main() {
	// Values for time duration
	LARGE_INTEGER tFreq, tStart, tEnd;
	cudaEvent_t start, stop;
	float tms, ms;

	int a[N], b[N], c[N];	// CPU values
	int *dev_a, *dev_b, *dev_c;	// GPU values----------------------------------------------

	// Creating alloc for GPU--------------------------------------------------------------
	cudaMalloc((void**)&dev_a, N * sizeof(int));
	cudaMalloc((void**)&dev_b, N * sizeof(int));
	cudaMalloc((void**)&dev_c, N * sizeof(int));

	// Fill 'a' and 'b' from CPU
	for (int i = 0; i < N; i++) {
		a[i] = -i;
		b[i] = i * i;
	}

	// Copy values of CPU to GPU values----------------------------------------------------
	cudaMemcpy(dev_a, a, N * sizeof(int), cudaMemcpyHostToDevice);
	cudaMemcpy(dev_b, b, N * sizeof(int), cudaMemcpyHostToDevice);


	//////////////////////////////////////
	QueryPerformanceFrequency(&tFreq);  // Frequency set
	QueryPerformanceCounter(&tStart);   // Time count Start

	// CPU operation
	add(a, b, c);

	//////////////////////////////////////
	QueryPerformanceCounter(&tEnd);     // TIme count End
	tms = ((tEnd.QuadPart - tStart.QuadPart) / (float)tFreq.QuadPart) * 1000;
	//////////////////////////////////////

	// show result of CPU
	cout << fixed;
	cout.precision(50);
	cout << "CPU Time=" << tms << endl << endl;

	/*for (int i = 0; i < N; i++) {
		printf("CPU calculate = %d + %d = %d\n", a[i],b[i],c[i]);
	}*/

	cout << endl;

	///////////////////////////////////////
	cudaEventCreate(&start);
	cudaEventCreate(&stop);
	cudaEventRecord(start, 0);
	// GPU operatinog---------------------------------------------------------------------
	add2 <<<N,1 >>> (dev_a, dev_b, dev_c);	// block
	//add2 << <1,N >> > (dev_a, dev_b, dev_c);	// Thread
	//add2 << <128,128>> > (dev_a, dev_b, dev_c);   // vector

	///////////////////////////////////////
	cudaEventRecord(stop, 0);
	cudaEventSynchronize(stop);
	cudaEventElapsedTime(&ms, start, stop);
	///////////////////////////////////////
	// show result of GPU
	cudaMemcpy(c, dev_c, N * sizeof(int), cudaMemcpyDeviceToHost);
	cout << fixed;
	cout.precision(50);
	cout << "GPU Time=" << ms << endl << endl;


	/*for (int i = 0; i < N; i++) {
		printf("GPU calculate = %d + %d = %d\n", a[i], b[i], c[i]);
	}*/

	//Free GPU values
	cudaFree(dev_a);
	cudaFree(dev_b);
	cudaFree(dev_c);

	return 0;
}

Please help me.
Thank you.

simple_time_duration_GPU_CPU.cu (3.11 KB)

Robert_Crovella · November 19, 2018, 5:48am

set N to 128*128
Use your add2 invocation with <<<128,128>>>
use the full grid tid variant in your kernel code

make sure you are building/timing a release project, not a debug project

arer90 · November 19, 2018, 8:11am

Thank Rober_Crovella!
I got the GPU process faster than CPU process.
I really want to say thank you very much.

But if i’m not bothering you, can i ask you a question?
I was thinking about the CPU sequence is step by step during calculation but GPU process is parallel process which means much faster than CPU even set N is small. Following result of my code before your help gave me a shock because of unexpected conclusion of GPU process time duration.

Thank you again your help i got the right result! But I want to know about any chances to make GPU process faster than CPU even set N is small.

Thank you again for replying my question and Have a nice day!

Robert_Crovella · November 19, 2018, 3:15pm

It’s harder to do when N is small because there are various overheads (for both CPU calculation and GPU calculation, but the GPU calculation overheads are much larger), furthermore on a single thread basis, the CPU is actually faster than the GPU. The GPU is only faster when there is lots of parallel work to engage all the GPU thread processors, and also when the effect of this outweighs the disparity in overhead

Stated another way, the GPU is simply not a good choice when N is small, for the problem you have outlined.

Topic		Replies	Views
CUDA slower than CPU? CUDA Programming and Performance	7	839	August 18, 2023
Why the following multigpu code works faster when I set GPU_N=1 while it is slower for GPU_N=4? CUDA Programming and Performance cuda	1	629	September 21, 2020
CUDA is slower than expected. Is something missing? CUDA Programming and Performance cuda , gpu , gpu-computing , parallel-computing	4	245	July 7, 2024
GPU vs. CPU GPU is always much slower CUDA Programming and Performance	1	10272	June 5, 2009
GPU is slower than CPU CUDA Programming and Performance	7	17892	August 10, 2017
Simple proven (timed) example code where GPU beats CPU, anyone? CUDA Programming and Performance	6	1138	November 1, 2013
Confused about GPU vs CPU speed in multiplication CUDA Programming and Performance	8	6548	February 19, 2009
Performance gap for a short test code between GPU and CPU CUDA Programming and Performance	8	1862	October 26, 2017
cuda gpu slower than cpu CUDA Programming and Performance	2	1087	May 1, 2012
Why is this slow CUDA Programming and Performance	7	3731	February 7, 2012

Cannot find a reason why CPU process much faster than GPU process in simple code

Related topics