Where are all the teraflops?

ingvarr · June 12, 2017, 8:50am

Hi, people,

this is our first program in CUDA, we try to accelerate a procedure for linear mixing of 42 arrays of 1024x1024 size each. However, we obtain a disappointing acceleration factor: CPU implementation has timing 2.7sec, GPU is 0.7sec. We have expected much higher acceleration.

Could you please advice what went wrong?

Our system is Windows, our CPU is Intel 2.2GHz, our GPU is GeForce 840M.

Thank you!

int nx=1024, ny=1024, nant = 42;

// procedure to accelerate
__global__ void gpu_mixr(double* hr, double* hi, double* cr, double* ci, double* wr, double* wi, int nx, int ny, int roundx, int roundy){
	int i = roundx + blockIdx.x;
	int j = roundy + threadIdx.x;
	wr[i + nx*j] = wi[i + nx*j] = 0;
		for (int k = 0; k<nant; k++){
			wr[i + nx*j] += hr[k + nant*(i + nx*j)] * cr[k] - hi[k + nant*(i + nx*j)] * ci[k];
			wi[i + nx*j] += hr[k + nant*(i + nx*j)] * ci[k] + hi[k + nant*(i + nx*j)] * cr[k];
		}
}

typedef thrust::host_vector<double> hvd;
typedef thrust::device_vector<double> dvd;

int main(int argc, char **argv){
....

	hvd hrh(nant*nx*ny), hih(nant*nx*ny);
// init h-vectors
	for (int j = 0; j < ny; j++)
	{
	   for (int i = 0; i < nx; i++)
	        for (int k = 0; k < nant; k++){
		     hrh[k + nant*(i + nx*j)] = ...; hih[k + nant*(i + nx*j)] = ...;
		}
	}
	dvd hr = hrh; dvd hi = hih;

	dvd cr(nant), ci(nant);
// init c-vectors
	for (int k = 0; k<nant; k++){
		cr[k] = ...; ci[k] = ...;
	}
	hvd crh = cr; hvd cih = ci;

	dvd wr(nx*ny), wi(nx*ny);
// w-vectors for output
	cudaError_t err = cudaSuccess;

	double* hr_array = thrust::raw_pointer_cast(&hr[0]);
	double* hi_array = thrust::raw_pointer_cast(&hi[0]);
	double* cr_array = thrust::raw_pointer_cast(&cr[0]);
	double* ci_array = thrust::raw_pointer_cast(&ci[0]);
	double* wr_array = thrust::raw_pointer_cast(&wr[0]);
	double* wi_array = thrust::raw_pointer_cast(&wi[0]);

	fprintf(stderr, "start gpu computing...\n");
	gettimeofday(&tv0, &tz);

// divide computation in 4 parts, otherwise does not match in our GPU memory
	gpu_mixr << <nx / 2, nx / 2 >> >(hr_array, hi_array, cr_array, ci_array, wr_array, wi_array, nx, ny, 0, 0);
	gpu_mixr << <nx / 2, ny / 2 >> >(hr_array, hi_array, cr_array, ci_array, wr_array, wi_array, nx, ny, 0, nx/2 );
	gpu_mixr << <nx / 2, ny / 2 >> >(hr_array, hi_array, cr_array, ci_array, wr_array, wi_array, nx, ny,  nx/2, 0);
	gpu_mixr << <nx / 2, ny / 2 >> >(hr_array, hi_array, cr_array, ci_array, wr_array, wi_array, nx, ny, nx / 2, ny/2);
	
	cudaDeviceSynchronize();

	gettimeofday(&tv1, &tz);

	double dt = tv1.tv_sec - tv0.tv_sec + (tv1.tv_usec - tv0.tv_usec)*1e-6;
	fprintf(stderr, "gpu time(mix/sec)=%g\n", dt);

// get result
	hvd wrh = wr; hvd wih = wi;

....

}

EDIT: we have done the following changes: nx=512, ny=512, this allows us to keep the whole problem in GPU memory and call gpu_mixr only once. Expectedly, the timing is reduced by factor 4.

Further, we inserted a dummy loop in the procedure:

// procedure to accelerate
__global__ void gpu_mixr(double* hr, double* hi, double* cr, double* ci, double* wr, double* wi, int nx, int ny, int roundx, int roundy){
  for(int dum=0;dum<100;dum++){
	int i = roundx + blockIdx.x;
	int j = roundy + threadIdx.x;
	wr[i + nx*j] = wi[i + nx*j] = 0;
		for (int k = 0; k<nant; k++){
			wr[i + nx*j] += hr[k + nant*(i + nx*j)] * cr[k] - hi[k + nant*(i + nx*j)] * ci[k];
			wi[i + nx*j] += hr[k + nant*(i + nx*j)] * ci[k] + hi[k + nant*(i + nx*j)] * cr[k];
		}
  }
}

We expect that only GPU computation timing will be multiplied by 100, not copying data there and back.

We have the following numbers: 0.16sec without dummy loop, 1.18sec with dummy loop. From here we estimate pure computation to 1.18/100=0.01sec and loading data to 0.15sec. CPU computation is 0.68sec, which gives acceleration factor 4.3 for single computation and 52 for repeated computation with “preloaded” data.

Is this estimation correct?

One refinement: our main purpose is to accelerate linear mixing of 42 arrays, where the content of the arrays is not changed, while the mix coefficients are changed in real time. In this test program we just trying to estimate the acceleration we can achieve.

Thanks for helping!

_constant · June 12, 2017, 9:13am

The 840M doesn’t have a single Teraflop of double precision processing.

Have you tried RTFM ?

Clochette · June 12, 2017, 9:53am

Not to mention adding arrays together is a horrible use of GPU. Your performance is likely bounded by the PCIe bus.

BulatZiganshin · June 12, 2017, 9:55am

such algo probably even more limited by memory throughput. f.e if your accelerator has memory speed 80 GB/sec, this means that you can read only 10^10 doubles each second. the same applies to CPU, so my first guess is that you just compared memory speeds of your cpu and gpu

EDIT: you haven’t measured time required to move arrays from CPU to GPU, but it will be even larger, ~~10 seconds, since PCI-E is even slower than CPU memory