CUDA trouble

RaZoR2013 · March 18, 2013, 7:22am

Hello, I wrote simple bench for CUDA vs CPU but
CUDA time 4539 ms for 1 mio cycles
CPU time 1580ms
Why CUDA is slower?
May be it is incorrect to run kernel multiple times?

global void
kernel(float *ex, float *v, float *hi, float *low, float *cl, int m)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j;
if (i < m)
{
j = ex[i];
v[i] = low[j];
}
}

int main()
{
int numBars = loadBars();
int numExtr = defineExtremums(numBars);

cout << “GPU kernel start:” << “\n”;
int N = numBars;
int M = numExtr;
float *ex, *z, *h, *l, *c;
float *d_x, *d_xx, *d_y, *d_z, *d_w;

int size_m = M * sizeof(float);
int size_n = N * sizeof(float);

ex = (float*)malloc(size_m);
z = (float*)malloc(size_m);
h = (float*)malloc(size_n);
l = (float*)malloc(size_n);
c = (float*)malloc(size_n);

cudaMalloc(&d_x, size_m);
cudaMalloc(&d_xx, size_m);
cudaMalloc(&d_y, size_n);
cudaMalloc(&d_z, size_n);
cudaMalloc(&d_w, size_n);

for (int i = 0; i < M; i++)
{
ex[i] = extr[i];
}

for (int i = 0; i < N; i++)
{
h[i] = high[i];
l[i] = low[i];
c[i] = close[i];
}

cudaMemcpy(d_x, ex, size_m, cudaMemcpyHostToDevice);
cudaMemcpy(d_y, h, size_n, cudaMemcpyHostToDevice);
cudaMemcpy(d_z, l, size_n, cudaMemcpyHostToDevice);
cudaMemcpy(d_w, c, size_n, cudaMemcpyHostToDevice);

dim3 threads = dim3(1024,1);
dim3 blocks = dim3(N/threads.x,1);

int numbench = 100000;
clock_t t1 = clock();
for (int u = 0; u < numbench; u++)
{
kernel<<<threads,blocks>>>(d_x, d_xx, d_y, d_z, d_w, M);
cudaMemcpy(z, d_xx, size_m, cudaMemcpyDeviceToHost);
}
clock_t t2 = clock();
clock_t t3 = t2-t1;
cout << "GPU: " << t3 << “\n”;

clock_t t4 = clock();
for (int u = 0; u < numbench; u++)
{
for(int i = 0; i < M; i++)
{
int j = ex[i];
z[i] = low[j];
}
}
clock_t t5 = clock();
clock_t t6 = t5-t4;
cout << "CPU: " << t6 << “\n”;

cudaFree(d_x);
cudaFree(d_xx);
cudaFree(d_y);
cudaFree(d_z);
cudaFree(d_w);

free(ex);
free(z);
free(h);
free(l);
free(c);

}

pasoleatis · March 18, 2013, 10:06pm

Hello,

I think the main problem in your code is the line cudaMemcpy(z, d_xx, size_m, cudaMemcpyDeviceToHost);. Also your method of getting time can give error, though when it is after a cudamemcpy call is ok. in order for cuda to be faster you need to get read of the copy calls , make a few as possible, or have the number of calculations much bigger than the number of array you cpy.

Karan_Sharma · March 19, 2013, 8:51am

Are you sure you are getting the correct outputs values from the GPU.
Plesae make a habit of using cudaThreadSynchronize() after any kernel launch.

pasoleatis · March 19, 2013, 1:25pm

The cudamemcpy function is a blocking function which means that the control is not returned to the host until is finished. It is not the recommended way, but it gives good enough results, since by the timed it is finished all gpu clacualtions must be finished. If you want to see what’s wrong with your code check the CUDA Best Practices document. Avoiding repetitive copying between host and gpu is one of the most important thing to maximize the performance.

Instead of:

for (int u = 0; u < numbench; u + +)
{
kernel < < < threads,blocks > > >(d_x, d_xx, d_y, d_z, d_w, M);
cudaMemcpy(z, d_xx, size_m, cudaMemcpyDeviceToHost);
}

try

for (int u = 0; u < numbench; u + +)
{
kernel < < < threads,blocks > > >(d_x, d_xx, d_y, d_z, d_w, M);
}

cudaMemcpy(z, d_xx, size_m, cudaMemcpyDeviceToHost);

If your particular algorithm requires frequent data transfers between host and gpu, you should try to overlap the transfers with the calculations to get the maximum performance.

Topic		Replies	Views
help with first cuda program CUDA Programming and Performance	5	3879	June 24, 2009
cuda gpu slower than cpu CUDA Programming and Performance	2	1086	May 1, 2012
Performance in basic algorithm Why isn't faster? CUDA Programming and Performance	4	1667	January 9, 2009
Timing of kernel getting more than a function that runs on only CPU why so...?? CUDA Programming and Performance	1	613	May 15, 2014
CUDA slower than CPU? CUDA Programming and Performance	7	796	August 18, 2023
Oscilating performance, Code total times variates CUDA Programming and Performance	10	10571	June 21, 2009
Odd performance problem/question CUDA Programming and Performance	3	831	June 3, 2009
processing time check CUDA Programming and Performance	5	551	November 16, 2010
slow kernel CUDA Programming and Performance	4	1445	June 25, 2009
well how do I know if cuda runs on the gpu CUDA Programming and Performance	20	13254	July 9, 2008

CUDA trouble

Related topics