device emulation faster than gpu

battersea · March 19, 2009, 12:54pm

I’m struggling with getting the following CUDA code up to speed. The code works (i.e. it returns the right output) in both emulated and GPU mode. However the GPU binary is about 4 times slower than the emulated binary (20 seconds as compared to 5 seconds). All memory allocations on the GPU succeed in less than one second, so the added execution time of the GPU binary is in the kernel call itself. I don’t expect the GPU binary to be faster just because it is run on the GPU, since my code is probably not very well optimized, but I expect it to have execution times at least equal to the emulated binary… Is this a reasonable assumption? Is my kernel unfriendly in terms of GPU memory access? Am I missing something here?

__global__ void kernel(float* F, float* A, float* B, float* T, int M, int N) {

	int v = threadIdx.x*M + blockIdx.x*blockDim.x*M;

	

	if (v < M*N) {

		for (int i = 0; i < M; i++) {

			T[v+i] = 0;

			for (int j = 0; j < M; j++) {

				T[v+i] += B[i*M+j]*F[v+j];

			}

		}

		for (int i = 0; i < M; i++) {

			F[v+i] = F[v+i]*A[i]/T[v+i];

		}

	}

}

int main() {

	int M = 752; // data per thread

	int N = 3584; // number of threads

	

	float* A = (float*) malloc(M*N*sizeof(float)); for (int i = 0; i < M*N; i++) { A[i] = 1; }

	float* B = (float*) malloc(M*M*sizeof(float)); for (int i = 0; i < M*M; i++) { B[i] = 1; }

	float* F = (float*) malloc(M*N*sizeof(float)); for (int i = 0; i < M*N; i++) { F[i] = 1; }

	

	int blocksize = 512;

	int gridsize = (int)ceil((float)N/(float)blocksize); // 7 in this example

	std::cout << "block size: " << blocksize << " threads" << std::endl;

	std::cout << "grid size: " << gridsize << " blocks" << std::endl;

	

	float* A_d; cudaMalloc((void**) &A_d, M*N*sizeof(float));

	float* B_d; cudaMalloc((void**) &B_d, M*M*sizeof(float));

	float* F_d; cudaMalloc((void**) &F_d, M*N*sizeof(float));

	float* T_d; cudaMalloc((void**) &T_d, M*N*sizeof(float));

	

	cudaMemcpy(A_d, A, M*N*sizeof(float), cudaMemcpyHostToDevice);

	cudaMemcpy(B_d, B, M*M*sizeof(float), cudaMemcpyHostToDevice);

	cudaMemcpy(F_d, F, M*N*sizeof(float), cudaMemcpyHostToDevice);

	

	kernel<<<gridsize, blocksize>>>(F_d, A_d, B_d, T_d, M, N);

	CUT_CHECK_ERROR("kernel error: ");

	

	cudaMemcpy(F, F_d, M*N*sizeof(float), cudaMemcpyDeviceToHost);

	

	cudaFree(A_d);

	cudaFree(B_d);

	cudaFree(F_d);

	cudaFree(T_d);

	

	free(A);

	free(B);

	free(F);

	

	return 0;

}

Some extra info: I’m using NVIDIA GeForce 8600 GTS and Cuda 2.1 on windows xp x64. The compiler is invoked as

nvcc -ccbin "C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\bin" -I "C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\include" -L "C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\lib\amd64" -o main_gpu.exe main.cu

and

nvcc -deviceemu -ccbin "C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\bin" -I "C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\include" -L "C:\Program Files (x86)\Microsoft Visual Studio 9.0\VC\lib\amd64" -o main_gpu.exe main.cu

for device emulation.

Jamie_K · March 19, 2009, 9:11pm

First, you’re only running 7 blocks, which will severely underutilize your GPU. With no more than one block per processor, memory stalls will turn into several hundred cycle idle periods. Decrease blocksize to perhaps 64 for better occupancy.

Also, your code is very memory-unfriendly (exacerbating the above problem), especially the innermost statement:

T[v+i] += B[i*M+j]*F[v+j];

The only real change I see without a major restructuring is to keep T[v+i] in a register:

if (v < M*N) {

		for (int i = 0; i < M; i++) {

			float Tvi = 0;

			for (int j = 0; j < M; j++) {

				Tvi += B[i*M+j]*F[v+j];

			}

			T[v+i] = Tvi;

		}

		for (int i = 0; i < M; i++) {

			F[v+i] = F[v+i]*A[i]/T[v+i];

		}

	}

I’m curious how much of a difference these two changes would make by themselves.

You may get even better performance may be if the second loop runs in a separate kernel with M threads doing N iterations, since A can be kept in a register, and loads and stores from F and T can potentially coalesce.

Better still would be to use cublas.

battersea · March 20, 2009, 12:47pm

Thanks for your suggestions!

I tried changing the block size, but that did not have much impact:

block size 512, no extra register: CPU: 17.9 sec; GPU: 30.2 sec
block size 64, no extra register: CPU: 17.8 sec; GPU: 31.9 sec

Saving T[v+i] in an extra register, however, did:

block size 512, extra register: CPU: 13.5 sec; GPU: 9.0 sec
block size 64, extra register: CPU: 13.5 sec; GPU: 9.0 sec

Which still isn’t that impressive.

I’ll look into your other suggestions: separate kernels and cublas.

About the separate kernels: wouldn’t it be a problem that M is only 752, resulting in only 752 threads to parallelize?

Topic		Replies	Views
device speed vs. host speed Why is my device program so slow? CUDA Programming and Performance	8	7887	August 16, 2007
CUDA is slower than expected. Is something missing? CUDA Programming and Performance cuda , gpu , gpu-computing , parallel-computing	4	157	July 7, 2024
device emulation mode C CUDA Programming and Performance	9	6447	February 27, 2008
GPU vs. CPU GPU is always much slower CUDA Programming and Performance	1	10238	June 5, 2009
Confused about GPU vs CPU speed in multiplication CUDA Programming and Performance	8	6529	February 19, 2009
CUDA slower than CPU? CUDA Programming and Performance	7	763	August 18, 2023
Can you GUESS this without experimenting? Latencies CUDA Programming and Performance	13	9346	January 7, 2008
Cannot find a reason why CPU process much faster than GPU process in simple code CUDA Programming and Performance	3	483	November 19, 2018
CUDA perormances CUDA Programming and Performance	10	7126	January 22, 2008
Cuda code performance CUDA Programming and Performance	14	3094	December 16, 2014

device emulation faster than gpu

Related topics