Cuda program taking more time.

Deepakjain · November 18, 2010, 8:54pm

Hello All,

I am Deepak and I am very new to CUDA. I started with the first program given in book CUDA BY EXAMPLE and i was shocked to see starnge results when i tried to measure the time of execution.

If I run the program using g++ compliler, it takes around 20ms to run the program but if I use CUDA, its takes 90 ms to run the program.

Following is my code in both versions.

include <stdio.h>

#include <time.h>

#include <math.h>

#include <stdlib.h>

#define N   10

void add( float *a, float *b, float *c ) {

    int tid = 0;

    while (tid < N) {

        c[tid] = (a[tid]/(a[tid]*a[tid])) + (b[tid]/(b[tid]*b[tid]));

        tid += 1;

    }

}

int main( void ) {

        float elapsed;

        float a[N], b[N], c[N]; 

        int i;

        clock_t timerStart, timerStop;

        for (i=0; i<N; i++) {

                a[i] = (float) (i)/(i+1);

                b[i] = (float) (i)/(i+1);

                c[i] = 0;

        }

        timerStart = clock();

        add( a, b, c );

        timerStop = clock();

        elapsed = (float) ( timerStop - timerStart ) / CLOCKS_PER_SEC;

        printf( "Time elapsed:  %f ", elapsed);

        for (i=0; i<N;i++)

        printf(" %f \n",c[i]);

return 0;

}

[b]

My CUDA Version is [/b]

#include <stdio.h>

#include <stdlib.h>

#include <string.h>

#include <cuda.h>

#include "cutil.h"

#define N 10

__global__ void add( float *a, float *b, float *c ) {

    int tid = blockIdx.x;   // TID is the block ID

    if (tid < N) {

       c[tid] = (a[tid]/(a[tid]*a[tid])) + (b[tid]/(b[tid]*b[tid]));

    }

}

int main( void ) {

        float a[N], b[N], c[N];

        float *temp_a,*temp_b,*temp_c;

        long i;

        float elapsed_time_cpu_gpu,elapsed_time_add,elapsed_time_gpu_cpu;

        cudaEvent_t start,stop,startadd,stopadd,startback,stopback;

        cudaEventCreate(&start);

        cudaEventCreate(&stop); 

        cudaEventCreate(&startadd);

        cudaEventCreate(&stopadd);      

        cudaEventCreate(&startback);

        cudaEventCreate(&stopback);     

cudaMalloc((void**)&temp_a,N*sizeof(int));

        cudaMalloc((void**)&temp_b,N*sizeof(int));

        cudaMalloc((void**)&temp_c,N*sizeof(int));

for (i=0; i<N; i++) {

           a[i] = (float) (i)/(i+1);

           b[i] = (float) (i)/(i+1);

           c[i] = 0;

       }

        cudaEventRecord(start,0);

cudaMemcpy(temp_a,a,N*sizeof(int),cudaMemcpyHostToDevice);

        cudaMemcpy(temp_b,b,N*sizeof(int),cudaMemcpyHostToDevice);

cudaEventRecord(stop,0);

        cudaEventSynchronize(stop);

        cudaEventElapsedTime(&elapsed_time_cpu_gpu,start,stop);

        printf("Time taken CUDA : %f \n",elapsed_time_cpu_gpu);

cudaEventDestroy(start);

        cudaEventDestroy(stop); 

cudaEventRecord(startadd,0);

add<<<N,N>>>(temp_a,temp_b,temp_c);

cudaEventRecord(stopadd,0);

        cudaEventSynchronize(stopadd);

        cudaEventElapsedTime(&elapsed_time_add,startadd,stopadd);

printf("Time taken CUDA : %f \n",elapsed_time_add);

cudaEventDestroy(startadd);

        cudaEventDestroy(stopadd); 

cudaEventRecord(startback,0);

cudaMemcpy(c,temp_c, N*sizeof(int),cudaMemcpyDeviceToHost);

cudaEventRecord(stopback,0);

        cudaEventSynchronize(stopback);

        cudaEventElapsedTime(&elapsed_time_gpu_cpu,startback,stopback);

cudaEventDestroy(startback);

        cudaEventDestroy(stopback); 

/*        for (i=0; i<N; i++) {

            printf ("%f %f %f\n", a[i], b[i], c[i] );

        }

*/

cudaFree(temp_a);

        cudaFree(temp_b);

        cudaFree(temp_c);

        return 0;

}

If I measure the timings, Processing of add function in CUDA takes 35 ms and processing of add in normal C program takes 20 ms.

Also when I measure individual timing of copying from Device to host and host to devices, it takes around 28 ms for each making total of around 85 ms for executing the complete program.

May be I am making a mistake but I am not able to correct out as I am very new to CUDA.

Please help me in this regard.

Thanks

Deepak

Deepakjain · November 18, 2010, 8:54pm

Hello All,

I am Deepak and I am very new to CUDA. I started with the first program given in book CUDA BY EXAMPLE and i was shocked to see starnge results when i tried to measure the time of execution.

If I run the program using g++ compliler, it takes around 20ms to run the program but if I use CUDA, its takes 90 ms to run the program.

Following is my code in both versions.

include <stdio.h>

#include <time.h>

#include <math.h>

#include <stdlib.h>

#define N   10

void add( float *a, float *b, float *c ) {

    int tid = 0;

    while (tid < N) {

        c[tid] = (a[tid]/(a[tid]*a[tid])) + (b[tid]/(b[tid]*b[tid]));

        tid += 1;

    }

}

int main( void ) {

        float elapsed;

        float a[N], b[N], c[N]; 

        int i;

        clock_t timerStart, timerStop;

        for (i=0; i<N; i++) {

                a[i] = (float) (i)/(i+1);

                b[i] = (float) (i)/(i+1);

                c[i] = 0;

        }

        timerStart = clock();

        add( a, b, c );

        timerStop = clock();

        elapsed = (float) ( timerStop - timerStart ) / CLOCKS_PER_SEC;

        printf( "Time elapsed:  %f ", elapsed);

        for (i=0; i<N;i++)

        printf(" %f \n",c[i]);

return 0;

}

[b]

My CUDA Version is [/b]

#include <stdio.h>

#include <stdlib.h>

#include <string.h>

#include <cuda.h>

#include "cutil.h"

#define N 10

__global__ void add( float *a, float *b, float *c ) {

    int tid = blockIdx.x;   // TID is the block ID

    if (tid < N) {

       c[tid] = (a[tid]/(a[tid]*a[tid])) + (b[tid]/(b[tid]*b[tid]));

    }

}

int main( void ) {

        float a[N], b[N], c[N];

        float *temp_a,*temp_b,*temp_c;

        long i;

        float elapsed_time_cpu_gpu,elapsed_time_add,elapsed_time_gpu_cpu;

        cudaEvent_t start,stop,startadd,stopadd,startback,stopback;

        cudaEventCreate(&start);

        cudaEventCreate(&stop); 

        cudaEventCreate(&startadd);

        cudaEventCreate(&stopadd);      

        cudaEventCreate(&startback);

        cudaEventCreate(&stopback);     

cudaMalloc((void**)&temp_a,N*sizeof(int));

        cudaMalloc((void**)&temp_b,N*sizeof(int));

        cudaMalloc((void**)&temp_c,N*sizeof(int));

for (i=0; i<N; i++) {

           a[i] = (float) (i)/(i+1);

           b[i] = (float) (i)/(i+1);

           c[i] = 0;

       }

        cudaEventRecord(start,0);

cudaMemcpy(temp_a,a,N*sizeof(int),cudaMemcpyHostToDevice);

        cudaMemcpy(temp_b,b,N*sizeof(int),cudaMemcpyHostToDevice);

cudaEventRecord(stop,0);

        cudaEventSynchronize(stop);

        cudaEventElapsedTime(&elapsed_time_cpu_gpu,start,stop);

        printf("Time taken CUDA : %f \n",elapsed_time_cpu_gpu);

cudaEventDestroy(start);

        cudaEventDestroy(stop); 

cudaEventRecord(startadd,0);

add<<<N,N>>>(temp_a,temp_b,temp_c);

cudaEventRecord(stopadd,0);

        cudaEventSynchronize(stopadd);

        cudaEventElapsedTime(&elapsed_time_add,startadd,stopadd);

printf("Time taken CUDA : %f \n",elapsed_time_add);

cudaEventDestroy(startadd);

        cudaEventDestroy(stopadd); 

cudaEventRecord(startback,0);

cudaMemcpy(c,temp_c, N*sizeof(int),cudaMemcpyDeviceToHost);

cudaEventRecord(stopback,0);

        cudaEventSynchronize(stopback);

        cudaEventElapsedTime(&elapsed_time_gpu_cpu,startback,stopback);

cudaEventDestroy(startback);

        cudaEventDestroy(stopback); 

/*        for (i=0; i<N; i++) {

            printf ("%f %f %f\n", a[i], b[i], c[i] );

        }

*/

cudaFree(temp_a);

        cudaFree(temp_b);

        cudaFree(temp_c);

        return 0;

}

If I measure the timings, Processing of add function in CUDA takes 35 ms and processing of add in normal C program takes 20 ms.

Also when I measure individual timing of copying from Device to host and host to devices, it takes around 28 ms for each making total of around 85 ms for executing the complete program.

May be I am making a mistake but I am not able to correct out as I am very new to CUDA.

Please help me in this regard.

Thanks

Deepak

Ken_Domino · November 19, 2010, 3:15pm

Hi Deepak,

Your program has a couple of errors. First, for the kernel call, that should be “add<<<N,1>>>” not “add<<<N,N>>>” (the book shows “add<<<N,1>>>”). With “N,N”, that would call the kernel in 10 blocks with 10 threads per block. But, there is also a problem in that the kernel is never called because of your event code. (I uncommented the code for the printf of a,b, and c, and the values for c are not set. I then put printf’s in the kernel, compiled with sm_20, ran it, and never saw a kernel being called.) So, I removed your event code and replaced it with a cudaThreadSynchronize. Fixing those two problems helps.

But, I probably wouldn’t set up the execution configuration like the do in the book. Although scalable, the call “add<<<N,1>>>” creates a grid with N blocks, with 1 thread per block. This doesn’t seem right because I think this will create one thread in a warp, and the GPU will serialize the warps. (The GPU may actually somehow combine the blocks into a warp, but I don’t know if it really does this.)

So, rewriting your code, I changed it to use 512 threads per block. Since automatic of large arrays blows the stack, I also changed them to malloc’s. With all these changes, I do get a speedup with N > 10,000, but nothing with lower values. For N = 10, the elapsed time for both function calls is “0”–too small to measure. For N=10,000,000, the runtime is: “Starting CPU test … Time elapsed: 0.255000, Starting GPU test …Time elapsed: 0.086000.”

Ken D.

#include <stdio.h>

#include <time.h>

#include <math.h>

#include <stdlib.h>

#define N   10000000

void addC( float *a, float *b, float *c ) {

    int tid = 0;

    while (tid < N) {

        c[tid] = (a[tid]/(a[tid]*a[tid])) + (b[tid]/(b[tid]*b[tid]));

        tid += 1;

    }

}

void t1()

{

    printf("Starting CPU test ...\n");

    float elapsed;

    float *a, *b, *c;

    a = (float*)malloc(N * sizeof(float));

    b = (float*)malloc(N * sizeof(float));

    c = (float*)malloc(N * sizeof(float));

    int i;

    clock_t timerStart, timerStop;

    for (i=0; i<N; i++) {

        a[i] = (float) (i)/(i+1);

        b[i] = (float) (i)/(i+1);

        c[i] = 0;

    }

timerStart = clock();

    addC( a, b, c );

    timerStop = clock();

elapsed = (float) ( timerStop - timerStart ) / CLOCKS_PER_SEC;

    printf( "Time elapsed:  %f\n", elapsed);

//       for (i=0; i<N;i++)

//        printf(" %f \n",c[i]);

    free(a);

    free(b);

    free(c);

}

__global__ void addG( float *a, float *b, float *c ) {

    int tid = threadIdx.x + blockIdx.x * blockDim.x;

    if (tid < N) {

       c[tid] = (a[tid]/(a[tid]*a[tid])) + (b[tid]/(b[tid]*b[tid]));

    }

}

void t2()

{

    printf("Starting GPU test ...\n");

    float elapsed;

    clock_t timerStart, timerStop;

    float *a, *b, *c;

    a = (float*)malloc(N * sizeof(float));

    b = (float*)malloc(N * sizeof(float));

    c = (float*)malloc(N * sizeof(float));

    float *temp_a,*temp_b,*temp_c;

    cudaMalloc((void**)&temp_a,N*sizeof(int));

    cudaMalloc((void**)&temp_b,N*sizeof(int));

    cudaMalloc((void**)&temp_c,N*sizeof(int));

for (int i=0; i<N; i++) {

        a[i] = (float) (i)/(i+1);

        b[i] = (float) (i)/(i+1);

        c[i] = 0;

    }

    timerStart = clock();

    cudaMemcpy(temp_a, a, N*sizeof(int), cudaMemcpyHostToDevice);

    cudaMemcpy(temp_b, b, N*sizeof(int), cudaMemcpyHostToDevice);

    addG<<<N/512 + 1, 512>>>(temp_a,temp_b,temp_c);

    cudaThreadSynchronize();

    int rv1 = cudaGetLastError();

    if (rv1)

    {

        printf("Fail1 %d\n", rv1);

        return;

    }

    cudaMemcpy(c,temp_c, N*sizeof(int),cudaMemcpyDeviceToHost);

    timerStop = clock();

    elapsed = (float) ( timerStop - timerStart ) / CLOCKS_PER_SEC;

    printf( "Time elapsed:  %f\n", elapsed);

/*  for (i=0; i<N; i++) {

        printf ("%f %f %f\n", a[i], b[i], c[i] );

    }

*/

    cudaFree(temp_a);

    cudaFree(temp_b);

    cudaFree(temp_c);

    free(a);

    free(b);

    free(c);

}

int main()

{

    t1();

    t2();

    return 0;

}

Ken_Domino · November 19, 2010, 3:15pm

Hi Deepak,

Your program has a couple of errors. First, for the kernel call, that should be “add<<<N,1>>>” not “add<<<N,N>>>” (the book shows “add<<<N,1>>>”). With “N,N”, that would call the kernel in 10 blocks with 10 threads per block. But, there is also a problem in that the kernel is never called because of your event code. (I uncommented the code for the printf of a,b, and c, and the values for c are not set. I then put printf’s in the kernel, compiled with sm_20, ran it, and never saw a kernel being called.) So, I removed your event code and replaced it with a cudaThreadSynchronize. Fixing those two problems helps.

But, I probably wouldn’t set up the execution configuration like the do in the book. Although scalable, the call “add<<<N,1>>>” creates a grid with N blocks, with 1 thread per block. This doesn’t seem right because I think this will create one thread in a warp, and the GPU will serialize the warps. (The GPU may actually somehow combine the blocks into a warp, but I don’t know if it really does this.)

So, rewriting your code, I changed it to use 512 threads per block. Since automatic of large arrays blows the stack, I also changed them to malloc’s. With all these changes, I do get a speedup with N > 10,000, but nothing with lower values. For N = 10, the elapsed time for both function calls is “0”–too small to measure. For N=10,000,000, the runtime is: “Starting CPU test … Time elapsed: 0.255000, Starting GPU test …Time elapsed: 0.086000.”

Ken D.

#include <stdio.h>

#include <time.h>

#include <math.h>

#include <stdlib.h>

#define N   10000000

void addC( float *a, float *b, float *c ) {

    int tid = 0;

    while (tid < N) {

        c[tid] = (a[tid]/(a[tid]*a[tid])) + (b[tid]/(b[tid]*b[tid]));

        tid += 1;

    }

}

void t1()

{

    printf("Starting CPU test ...\n");

    float elapsed;

    float *a, *b, *c;

    a = (float*)malloc(N * sizeof(float));

    b = (float*)malloc(N * sizeof(float));

    c = (float*)malloc(N * sizeof(float));

    int i;

    clock_t timerStart, timerStop;

    for (i=0; i<N; i++) {

        a[i] = (float) (i)/(i+1);

        b[i] = (float) (i)/(i+1);

        c[i] = 0;

    }

timerStart = clock();

    addC( a, b, c );

    timerStop = clock();

elapsed = (float) ( timerStop - timerStart ) / CLOCKS_PER_SEC;

    printf( "Time elapsed:  %f\n", elapsed);

//       for (i=0; i<N;i++)

//        printf(" %f \n",c[i]);

    free(a);

    free(b);

    free(c);

}

__global__ void addG( float *a, float *b, float *c ) {

    int tid = threadIdx.x + blockIdx.x * blockDim.x;

    if (tid < N) {

       c[tid] = (a[tid]/(a[tid]*a[tid])) + (b[tid]/(b[tid]*b[tid]));

    }

}

void t2()

{

    printf("Starting GPU test ...\n");

    float elapsed;

    clock_t timerStart, timerStop;

    float *a, *b, *c;

    a = (float*)malloc(N * sizeof(float));

    b = (float*)malloc(N * sizeof(float));

    c = (float*)malloc(N * sizeof(float));

    float *temp_a,*temp_b,*temp_c;

    cudaMalloc((void**)&temp_a,N*sizeof(int));

    cudaMalloc((void**)&temp_b,N*sizeof(int));

    cudaMalloc((void**)&temp_c,N*sizeof(int));

for (int i=0; i<N; i++) {

        a[i] = (float) (i)/(i+1);

        b[i] = (float) (i)/(i+1);

        c[i] = 0;

    }

    timerStart = clock();

    cudaMemcpy(temp_a, a, N*sizeof(int), cudaMemcpyHostToDevice);

    cudaMemcpy(temp_b, b, N*sizeof(int), cudaMemcpyHostToDevice);

    addG<<<N/512 + 1, 512>>>(temp_a,temp_b,temp_c);

    cudaThreadSynchronize();

    int rv1 = cudaGetLastError();

    if (rv1)

    {

        printf("Fail1 %d\n", rv1);

        return;

    }

    cudaMemcpy(c,temp_c, N*sizeof(int),cudaMemcpyDeviceToHost);

    timerStop = clock();

    elapsed = (float) ( timerStop - timerStart ) / CLOCKS_PER_SEC;

    printf( "Time elapsed:  %f\n", elapsed);

/*  for (i=0; i<N; i++) {

        printf ("%f %f %f\n", a[i], b[i], c[i] );

    }

*/

    cudaFree(temp_a);

    cudaFree(temp_b);

    cudaFree(temp_c);

    free(a);

    free(b);

    free(c);

}

int main()

{

    t1();

    t2();

    return 0;

}

seibert · November 19, 2010, 4:16pm

It does not. One thread per block underutilizes the multiprocessor by a factor of 32.

seibert · November 19, 2010, 4:16pm

It does not. One thread per block underutilizes the multiprocessor by a factor of 32.

Deepakjain · November 20, 2010, 10:31am

Hello kaberdude

Thanks for the pain which you have taken for me. I am new to CUDA and was losing interest in CUDA and was thinking to move to TBB. But your reply has surely helped me gain more insight about CUDA and concept of thread.

I think I need to read more about threads and other programming skills in CUDA.

Can you please suggest me some sources from where I can read and learn more about this. I am already reading CUDA by example.

Thanks

Deepak

Hi Deepak,

Your program has a couple of errors. First, for the kernel call, that should be “add<<<N,1>>>” not “add<<<N,N>>>” (the book shows “add<<<N,1>>>”). With “N,N”, that would call the kernel in 10 blocks with 10 threads per block. But, there is also a problem in that the kernel is never called because of your event code. (I uncommented the code for the printf of a,b, and c, and the values for c are not set. I then put printf’s in the kernel, compiled with sm_20, ran it, and never saw a kernel being called.) So, I removed your event code and replaced it with a cudaThreadSynchronize. Fixing those two problems helps.

But, I probably wouldn’t set up the execution configuration like the do in the book. Although scalable, the call “add<<<N,1>>>” creates a grid with N blocks, with 1 thread per block. This doesn’t seem right because I think this will create one thread in a warp, and the GPU will serialize the warps. (The GPU may actually somehow combine the blocks into a warp, but I don’t know if it really does this.)

So, rewriting your code, I changed it to use 512 threads per block. Since automatic of large arrays blows the stack, I also changed them to malloc’s. With all these changes, I do get a speedup with N > 10,000, but nothing with lower values. For N = 10, the elapsed time for both function calls is “0”–too small to measure. For N=10,000,000, the runtime is: “Starting CPU test … Time elapsed: 0.255000, Starting GPU test …Time elapsed: 0.086000.”

Ken D.
#include <stdio.h>

#include <time.h>

#include <math.h>

#include <stdlib.h>

#define N   10000000

void addC( float *a, float *b, float *c ) {

    int tid = 0;

    while (tid < N) {

        c[tid] = (a[tid]/(a[tid]*a[tid])) + (b[tid]/(b[tid]*b[tid]));

        tid += 1;

    }

}

void t1()

{

    printf("Starting CPU test ...\n");

    float elapsed;

    float *a, *b, *c;

    a = (float*)malloc(N * sizeof(float));

    b = (float*)malloc(N * sizeof(float));

    c = (float*)malloc(N * sizeof(float));

    int i;

    clock_t timerStart, timerStop;

    for (i=0; i<N; i++) {

        a[i] = (float) (i)/(i+1);

        b[i] = (float) (i)/(i+1);

        c[i] = 0;

    }

timerStart = clock();

    addC( a, b, c );

    timerStop = clock();

elapsed = (float) ( timerStop - timerStart ) / CLOCKS_PER_SEC;

    printf( "Time elapsed:  %f\n", elapsed);

//       for (i=0; i<N;i++)

//        printf(" %f \n",c[i]);

    free(a);

    free(b);

    free(c);

}

__global__ void addG( float *a, float *b, float *c ) {

    int tid = threadIdx.x + blockIdx.x * blockDim.x;

    if (tid < N) {

       c[tid] = (a[tid]/(a[tid]*a[tid])) + (b[tid]/(b[tid]*b[tid]));

    }

}

void t2()

{

    printf("Starting GPU test ...\n");

    float elapsed;

    clock_t timerStart, timerStop;

    float *a, *b, *c;

    a = (float*)malloc(N * sizeof(float));

    b = (float*)malloc(N * sizeof(float));

    c = (float*)malloc(N * sizeof(float));

    float *temp_a,*temp_b,*temp_c;

    cudaMalloc((void**)&temp_a,N*sizeof(int));

    cudaMalloc((void**)&temp_b,N*sizeof(int));

    cudaMalloc((void**)&temp_c,N*sizeof(int));

for (int i=0; i<N; i++) {

        a[i] = (float) (i)/(i+1);

        b[i] = (float) (i)/(i+1);

        c[i] = 0;

    }

    timerStart = clock();

    cudaMemcpy(temp_a, a, N*sizeof(int), cudaMemcpyHostToDevice);

    cudaMemcpy(temp_b, b, N*sizeof(int), cudaMemcpyHostToDevice);

    addG<<<N/512 + 1, 512>>>(temp_a,temp_b,temp_c);

    cudaThreadSynchronize();

    int rv1 = cudaGetLastError();

    if (rv1)

    {

        printf("Fail1 %d\n", rv1);

        return;

    }

    cudaMemcpy(c,temp_c, N*sizeof(int),cudaMemcpyDeviceToHost);

    timerStop = clock();

    elapsed = (float) ( timerStop - timerStart ) / CLOCKS_PER_SEC;

    printf( "Time elapsed:  %f\n", elapsed);

/*  for (i=0; i<N; i++) {

        printf ("%f %f %f\n", a[i], b[i], c[i] );

    }

*/

    cudaFree(temp_a);

    cudaFree(temp_b);

    cudaFree(temp_c);

    free(a);

    free(b);

    free(c);

}

int main()

{

    t1();

    t2();

    return 0;

}

Deepakjain · November 20, 2010, 10:31am

Hello kaberdude

Thanks for the pain which you have taken for me. I am new to CUDA and was losing interest in CUDA and was thinking to move to TBB. But your reply has surely helped me gain more insight about CUDA and concept of thread.

I think I need to read more about threads and other programming skills in CUDA.

Can you please suggest me some sources from where I can read and learn more about this. I am already reading CUDA by example.

Thanks

Deepak

Hi Deepak,

Your program has a couple of errors. First, for the kernel call, that should be “add<<<N,1>>>” not “add<<<N,N>>>” (the book shows “add<<<N,1>>>”). With “N,N”, that would call the kernel in 10 blocks with 10 threads per block. But, there is also a problem in that the kernel is never called because of your event code. (I uncommented the code for the printf of a,b, and c, and the values for c are not set. I then put printf’s in the kernel, compiled with sm_20, ran it, and never saw a kernel being called.) So, I removed your event code and replaced it with a cudaThreadSynchronize. Fixing those two problems helps.

But, I probably wouldn’t set up the execution configuration like the do in the book. Although scalable, the call “add<<<N,1>>>” creates a grid with N blocks, with 1 thread per block. This doesn’t seem right because I think this will create one thread in a warp, and the GPU will serialize the warps. (The GPU may actually somehow combine the blocks into a warp, but I don’t know if it really does this.)

So, rewriting your code, I changed it to use 512 threads per block. Since automatic of large arrays blows the stack, I also changed them to malloc’s. With all these changes, I do get a speedup with N > 10,000, but nothing with lower values. For N = 10, the elapsed time for both function calls is “0”–too small to measure. For N=10,000,000, the runtime is: “Starting CPU test … Time elapsed: 0.255000, Starting GPU test …Time elapsed: 0.086000.”

Ken D.
#include <stdio.h>

#include <time.h>

#include <math.h>

#include <stdlib.h>

#define N   10000000

void addC( float *a, float *b, float *c ) {

    int tid = 0;

    while (tid < N) {

        c[tid] = (a[tid]/(a[tid]*a[tid])) + (b[tid]/(b[tid]*b[tid]));

        tid += 1;

    }

}

void t1()

{

    printf("Starting CPU test ...\n");

    float elapsed;

    float *a, *b, *c;

    a = (float*)malloc(N * sizeof(float));

    b = (float*)malloc(N * sizeof(float));

    c = (float*)malloc(N * sizeof(float));

    int i;

    clock_t timerStart, timerStop;

    for (i=0; i<N; i++) {

        a[i] = (float) (i)/(i+1);

        b[i] = (float) (i)/(i+1);

        c[i] = 0;

    }

timerStart = clock();

    addC( a, b, c );

    timerStop = clock();

elapsed = (float) ( timerStop - timerStart ) / CLOCKS_PER_SEC;

    printf( "Time elapsed:  %f\n", elapsed);

//       for (i=0; i<N;i++)

//        printf(" %f \n",c[i]);

    free(a);

    free(b);

    free(c);

}

__global__ void addG( float *a, float *b, float *c ) {

    int tid = threadIdx.x + blockIdx.x * blockDim.x;

    if (tid < N) {

       c[tid] = (a[tid]/(a[tid]*a[tid])) + (b[tid]/(b[tid]*b[tid]));

    }

}

void t2()

{

    printf("Starting GPU test ...\n");

    float elapsed;

    clock_t timerStart, timerStop;

    float *a, *b, *c;

    a = (float*)malloc(N * sizeof(float));

    b = (float*)malloc(N * sizeof(float));

    c = (float*)malloc(N * sizeof(float));

    float *temp_a,*temp_b,*temp_c;

    cudaMalloc((void**)&temp_a,N*sizeof(int));

    cudaMalloc((void**)&temp_b,N*sizeof(int));

    cudaMalloc((void**)&temp_c,N*sizeof(int));

for (int i=0; i<N; i++) {

        a[i] = (float) (i)/(i+1);

        b[i] = (float) (i)/(i+1);

        c[i] = 0;

    }

    timerStart = clock();

    cudaMemcpy(temp_a, a, N*sizeof(int), cudaMemcpyHostToDevice);

    cudaMemcpy(temp_b, b, N*sizeof(int), cudaMemcpyHostToDevice);

    addG<<<N/512 + 1, 512>>>(temp_a,temp_b,temp_c);

    cudaThreadSynchronize();

    int rv1 = cudaGetLastError();

    if (rv1)

    {

        printf("Fail1 %d\n", rv1);

        return;

    }

    cudaMemcpy(c,temp_c, N*sizeof(int),cudaMemcpyDeviceToHost);

    timerStop = clock();

    elapsed = (float) ( timerStop - timerStart ) / CLOCKS_PER_SEC;

    printf( "Time elapsed:  %f\n", elapsed);

/*  for (i=0; i<N; i++) {

        printf ("%f %f %f\n", a[i], b[i], c[i] );

    }

*/

    cudaFree(temp_a);

    cudaFree(temp_b);

    cudaFree(temp_c);

    free(a);

    free(b);

    free(c);

}

int main()

{

    t1();

    t2();

    return 0;

}

Ken_Domino · November 20, 2010, 2:37pm

Hi Deepak, There are not a lot of books on CUDA. If you go to Amazon and look for CUDA, you get CUDA by Example (ISBN 0131387685) and Programming Massively Parallel Processors (ISBN 0123814723). CUDA by Example is OK for the beginner, with a number of examples. The Programming Massively Parallel Processors book seems to cover more topics, but with fewer examples. Neither seem to go into basic parallel programming topics, like sorting and searching. So, I usually go to conference proceedings, workshops, and journal articles to get more information on how CUDA is used to solve a particular problem (see scholar.google.com). Unfortunately, those sources can have a lot of typos. Also, although very painful to read grungy code, you can go through the NVIDIA GPU Computing SDK examples. Ken

Ken_Domino · November 20, 2010, 2:37pm

Hi Deepak, There are not a lot of books on CUDA. If you go to Amazon and look for CUDA, you get CUDA by Example (ISBN 0131387685) and Programming Massively Parallel Processors (ISBN 0123814723). CUDA by Example is OK for the beginner, with a number of examples. The Programming Massively Parallel Processors book seems to cover more topics, but with fewer examples. Neither seem to go into basic parallel programming topics, like sorting and searching. So, I usually go to conference proceedings, workshops, and journal articles to get more information on how CUDA is used to solve a particular problem (see scholar.google.com). Unfortunately, those sources can have a lot of typos. Also, although very painful to read grungy code, you can go through the NVIDIA GPU Computing SDK examples. Ken

Deepakjain · November 20, 2010, 5:29pm

Hello Ken,

Thanks for the advice.

Also when I have one question which is boggling me. Why CUDA doesn’t give optimization for smaller N i.e N < 10,000.

I just used <<<N,N>>> intently so as to see the difference in result from <<<N,1>>> but their was not much of difference.

I future plan is to work on images and modify the existing algorithms using CUDA so as to process resukts in short time.

Deepak

Hi Deepak, There are not a lot of books on CUDA. If you go to Amazon and look for CUDA, you get CUDA by Example (ISBN 0131387685) and Programming Massively Parallel Processors (ISBN 0123814723). CUDA by Example is OK for the beginner, with a number of examples. The Programming Massively Parallel Processors book seems to cover more topics, but with fewer examples. Neither seem to go into basic parallel programming topics, like sorting and searching. So, I usually go to conference proceedings, workshops, and journal articles to get more information on how CUDA is used to solve a particular problem (see scholar.google.com). Unfortunately, those sources can have a lot of typos. Also, although very painful to read grungy code, you can go through the NVIDIA GPU Computing SDK examples. Ken

Deepakjain · November 20, 2010, 5:29pm

Hello Ken,

Thanks for the advice.

Also when I have one question which is boggling me. Why CUDA doesn’t give optimization for smaller N i.e N < 10,000.

I just used <<<N,N>>> intently so as to see the difference in result from <<<N,1>>> but their was not much of difference.

I future plan is to work on images and modify the existing algorithms using CUDA so as to process resukts in short time.

Deepak

Hi Deepak, There are not a lot of books on CUDA. If you go to Amazon and look for CUDA, you get CUDA by Example (ISBN 0131387685) and Programming Massively Parallel Processors (ISBN 0123814723). CUDA by Example is OK for the beginner, with a number of examples. The Programming Massively Parallel Processors book seems to cover more topics, but with fewer examples. Neither seem to go into basic parallel programming topics, like sorting and searching. So, I usually go to conference proceedings, workshops, and journal articles to get more information on how CUDA is used to solve a particular problem (see scholar.google.com). Unfortunately, those sources can have a lot of typos. Also, although very painful to read grungy code, you can go through the NVIDIA GPU Computing SDK examples. Ken

paulius · November 21, 2010, 1:45am

You should read CUDA C Programming Guide for a deeper coverage of various features. Best Practices Guide goes over the main optimizations. Both are included with the CUDA toolkit, just check the document directory. Also, you should check out the various CUDA webinars and tutorials:

[*] CUDA Webinars

[*] CUDA Tutorials at Supercomputing 2010

[*] Presentations from GTC 2010

[*] Presentations from GTC 2009

paulius · November 21, 2010, 1:45am

You should read CUDA C Programming Guide for a deeper coverage of various features. Best Practices Guide goes over the main optimizations. Both are included with the CUDA toolkit, just check the document directory. Also, you should check out the various CUDA webinars and tutorials:

[*] CUDA Webinars

[*] CUDA Tutorials at Supercomputing 2010

[*] Presentations from GTC 2010

[*] Presentations from GTC 2009

Deepakjain · November 21, 2010, 11:00am

Thanks Paulius for the info.

This are indeed very good sources.

Thanks again.

Deepak

Deepakjain · November 21, 2010, 11:00am

Thanks Paulius for the info.

This are indeed very good sources.

Thanks again.

Deepak

Topic		Replies	Views
more time taken by CUDA rather than reducing time CUDA Programming and Performance	7	4665	November 18, 2010
need a help from employees or guys who know compiler well CUDA Programming and Performance	22	8788	December 18, 2008
Confused about GPU vs CPU speed in multiplication CUDA Programming and Performance	8	6625	February 19, 2009
Is CUDA really that fast? CUDA Programming and Performance	17	11875	September 21, 2009
CUDA Timing Question CUDA Programming and Performance	7	2045	September 1, 2009
GPU and CPU don't run in (pure) parallel ? CUDA Programming and Performance	24	20323	May 4, 2007
How to get more Gflops ? :) CUDA Programming and Performance	21	27764	September 12, 2008
performance issue CUDA Programming and Performance	21	10446	April 30, 2007
Performance Boost Not Really Seen CUDA Programming and Performance	8	982	December 21, 2010
erratic results, but C always beats FORTRAN Legacy PGI Compilers	8	5187	April 5, 2011

Cuda program taking more time.

Related topics