First CUDA application, performance issues Looking for anyone who can give advice on how to speed up


Writing my first CUDA application I’ve run into some performance issues (I hope). I’m looking for advices on how to speed up the code below. I’ve experimented with the number of threads to run for each block and 256 seems to be the magic number. Some attempts with __syncthreads and shared arrays for the wref and dref arrays have been made but it only increased the execution time.

The reason I belive I have a performance issue is that the code only run three times faster than on a CPU (using Intel’s IPP libraries). Anyone who has any thoughts on how to increase the performance?

#include "stdafx.h"

#include <cuda.h>

#include <stdio.h>

#include <limits>

__global__ void CalculateWD(float* wref,float* dref,float* buffer,int length){

	unsigned int tid = threadIdx.x + blockIdx.x*blockDim.x;

	if(tid < length){

		float w = wref[tid];

		float d = dref[tid];

		float temp = 1 / w - d;

		if(temp < FLT_MIN){

			temp = 1 / FLT_MAX;


		buffer[tid] = temp;



int main()


	int rows = 320;

	int columns = 256;

	float* wref = (float*)malloc(rows*columns*sizeof(float));

	float* dref = (float*)malloc(rows*columns*sizeof(float));

	float* buffer = (float*)malloc(rows*columns*sizeof(float));

	for(int i = 0;i < rows*columns;i++){

		wref[i] = i + 3;

		dref[i] = i + 1;


	float* d_wref;

	float* d_dref;

	float* d_buffer;






	int threadsPerBlock = 256;

	int blocksPerGrid = (rows*columns + threadsPerBlock-1) / threadsPerBlock;

	cudaEvent_t start, stop;




// runs at approx 0.033 ms on both Geforce 285 GTX




	float elapsedTime;



	printf("Execution time: %f ms\n",elapsedTime);





You are only working on a very small dataset of 80,000 or so elements. Most of your slow-down is probably just in the overhead of setting up and calling the GPU kernel. Try using a very large dataset, 10 million or so elements,and see if you get a better performance result (GPU vs. CPU).


Increasing the data size does give a better GPU/CPU ratio, but the size of the dataset is fixed. We’re building a software for real-time analysis using data from a hyperspectral camera. The size is fixed to 320 x 256 so its for that size I would like to increase the performance. Our current CPU implementation does the job quite well, but I’m investigating if it would be worth the effort of porting the code to a GPU.

You could possibly amortize the overhead by batching many images together and processing them at once. Unfortunately, that will increase overall processing latency which might conflict with your real-time requirements. It would also be interesting to modify your program below to call CalculateWD several times between your two cudaEvents and see if the execution time rises proportionally or not. There is often an additional delay incurred on the first execution of a kernel if the driver needs to JIT recompile the PTX and/or transfer the kernel binary to the device. Since your real application will call the same kernel many times, you might be getting a biased timing here by calling it only once.

Also, although you do not time it here, the actual bottleneck in applying this code to a real problem will be cudaMemcpy(). With pinned host memory, you will only achieve at best 6.5 GB/sec on large transfers. That means if you transferred many 320x256 images at once, you would spend at least 0.05 ms per image, or 0.15 ms for your two image inputs and reading back to the host the one image output. For smaller transfers, the per-transfer overhead will make the performance worse.

Unless there is substantially more calculation to be done per image on the device, I don’t think CUDA will beat your current implementation once you account for PCI-Express overhead.

Edit: Of course, with CUDA you can overlap data transfer and kernel execution, but given the small amount of calculation being done here, that just means the kernel execution will be “free”, and the entire runtime will be dominated by the data transfer.

Thank you for your thoughts! The transfer rate is something that we are aware of and it will be acceptible if we perform all calculations on the GPU, meaning that the image would only be copied ones. The output result is generally reduced to 320x(1 to 5) so that delay would also be acceptible. The camera sensor has a refresh-rate of 500 Hz, giving is a full 2 ms to transfer the line to the GPU perform our calculations and transfer the result back.