Hi, people,
this is our first program in CUDA, we try to accelerate a procedure for linear mixing of 42 arrays of 1024x1024 size each. However, we obtain a disappointing acceleration factor: CPU implementation has timing 2.7sec, GPU is 0.7sec. We have expected much higher acceleration.
Could you please advice what went wrong?
Our system is Windows, our CPU is Intel 2.2GHz, our GPU is GeForce 840M.
Thank you!
int nx=1024, ny=1024, nant = 42;
// procedure to accelerate
__global__ void gpu_mixr(double* hr, double* hi, double* cr, double* ci, double* wr, double* wi, int nx, int ny, int roundx, int roundy){
int i = roundx + blockIdx.x;
int j = roundy + threadIdx.x;
wr[i + nx*j] = wi[i + nx*j] = 0;
for (int k = 0; k<nant; k++){
wr[i + nx*j] += hr[k + nant*(i + nx*j)] * cr[k] - hi[k + nant*(i + nx*j)] * ci[k];
wi[i + nx*j] += hr[k + nant*(i + nx*j)] * ci[k] + hi[k + nant*(i + nx*j)] * cr[k];
}
}
typedef thrust::host_vector<double> hvd;
typedef thrust::device_vector<double> dvd;
int main(int argc, char **argv){
....
hvd hrh(nant*nx*ny), hih(nant*nx*ny);
// init h-vectors
for (int j = 0; j < ny; j++)
{
for (int i = 0; i < nx; i++)
for (int k = 0; k < nant; k++){
hrh[k + nant*(i + nx*j)] = ...; hih[k + nant*(i + nx*j)] = ...;
}
}
dvd hr = hrh; dvd hi = hih;
dvd cr(nant), ci(nant);
// init c-vectors
for (int k = 0; k<nant; k++){
cr[k] = ...; ci[k] = ...;
}
hvd crh = cr; hvd cih = ci;
dvd wr(nx*ny), wi(nx*ny);
// w-vectors for output
cudaError_t err = cudaSuccess;
double* hr_array = thrust::raw_pointer_cast(&hr[0]);
double* hi_array = thrust::raw_pointer_cast(&hi[0]);
double* cr_array = thrust::raw_pointer_cast(&cr[0]);
double* ci_array = thrust::raw_pointer_cast(&ci[0]);
double* wr_array = thrust::raw_pointer_cast(&wr[0]);
double* wi_array = thrust::raw_pointer_cast(&wi[0]);
fprintf(stderr, "start gpu computing...\n");
gettimeofday(&tv0, &tz);
// divide computation in 4 parts, otherwise does not match in our GPU memory
gpu_mixr << <nx / 2, nx / 2 >> >(hr_array, hi_array, cr_array, ci_array, wr_array, wi_array, nx, ny, 0, 0);
gpu_mixr << <nx / 2, ny / 2 >> >(hr_array, hi_array, cr_array, ci_array, wr_array, wi_array, nx, ny, 0, nx/2 );
gpu_mixr << <nx / 2, ny / 2 >> >(hr_array, hi_array, cr_array, ci_array, wr_array, wi_array, nx, ny, nx/2, 0);
gpu_mixr << <nx / 2, ny / 2 >> >(hr_array, hi_array, cr_array, ci_array, wr_array, wi_array, nx, ny, nx / 2, ny/2);
cudaDeviceSynchronize();
gettimeofday(&tv1, &tz);
double dt = tv1.tv_sec - tv0.tv_sec + (tv1.tv_usec - tv0.tv_usec)*1e-6;
fprintf(stderr, "gpu time(mix/sec)=%g\n", dt);
// get result
hvd wrh = wr; hvd wih = wi;
....
}
EDIT: we have done the following changes: nx=512, ny=512, this allows us to keep the whole problem in GPU memory and call gpu_mixr only once. Expectedly, the timing is reduced by factor 4.
Further, we inserted a dummy loop in the procedure:
// procedure to accelerate
__global__ void gpu_mixr(double* hr, double* hi, double* cr, double* ci, double* wr, double* wi, int nx, int ny, int roundx, int roundy){
for(int dum=0;dum<100;dum++){
int i = roundx + blockIdx.x;
int j = roundy + threadIdx.x;
wr[i + nx*j] = wi[i + nx*j] = 0;
for (int k = 0; k<nant; k++){
wr[i + nx*j] += hr[k + nant*(i + nx*j)] * cr[k] - hi[k + nant*(i + nx*j)] * ci[k];
wi[i + nx*j] += hr[k + nant*(i + nx*j)] * ci[k] + hi[k + nant*(i + nx*j)] * cr[k];
}
}
}
We expect that only GPU computation timing will be multiplied by 100, not copying data there and back.
We have the following numbers: 0.16sec without dummy loop, 1.18sec with dummy loop. From here we estimate pure computation to 1.18/100=0.01sec and loading data to 0.15sec. CPU computation is 0.68sec, which gives acceleration factor 4.3 for single computation and 52 for repeated computation with “preloaded” data.
Is this estimation correct?
One refinement: our main purpose is to accelerate linear mixing of 42 arrays, where the content of the arrays is not changed, while the mix coefficients are changed in real time. In this test program we just trying to estimate the acceleration we can achieve.
Thanks for helping!