Cuda program taking more time.

Hello All,

I am Deepak and I am very new to CUDA. I started with the first program given in book CUDA BY EXAMPLE and i was shocked to see starnge results when i tried to measure the time of execution.

If I run the program using g++ compliler, it takes around 20ms to run the program but if I use CUDA, its takes 90 ms to run the program.

Following is my code in both versions.

include <stdio.h>

#include <time.h>

#include <math.h>

#include <stdlib.h>

#define N   10

void add( float *a, float *b, float *c ) {

    int tid = 0;

    while (tid < N) {

        c[tid] = (a[tid]/(a[tid]*a[tid])) + (b[tid]/(b[tid]*b[tid]));

        tid += 1;

    }

}

int main( void ) {

        float elapsed;

        float a[N], b[N], c[N]; 

        int i;

        clock_t timerStart, timerStop;

        for (i=0; i<N; i++) {

                a[i] = (float) (i)/(i+1);

                b[i] = (float) (i)/(i+1);

                c[i] = 0;

        }

        timerStart = clock();

        add( a, b, c );

        timerStop = clock();

        elapsed = (float) ( timerStop - timerStart ) / CLOCKS_PER_SEC;

        printf( "Time elapsed:  %f ", elapsed);

        for (i=0; i<N;i++)

        printf(" %f \n",c[i]);

return 0;

}

[b]

My CUDA Version is [/b]

#include <stdio.h>

#include <stdlib.h>

#include <string.h>

#include <cuda.h>

#include "cutil.h"

#define N 10

__global__ void add( float *a, float *b, float *c ) {

    int tid = blockIdx.x;   // TID is the block ID

    if (tid < N) {

       c[tid] = (a[tid]/(a[tid]*a[tid])) + (b[tid]/(b[tid]*b[tid]));

    }

}

int main( void ) {

        float a[N], b[N], c[N];

        float *temp_a,*temp_b,*temp_c;

        long i;

        float elapsed_time_cpu_gpu,elapsed_time_add,elapsed_time_gpu_cpu;

        cudaEvent_t start,stop,startadd,stopadd,startback,stopback;

        cudaEventCreate(&start);

        cudaEventCreate(&stop); 

        cudaEventCreate(&startadd);

        cudaEventCreate(&stopadd);      

        cudaEventCreate(&startback);

        cudaEventCreate(&stopback);     

cudaMalloc((void**)&temp_a,N*sizeof(int));

        cudaMalloc((void**)&temp_b,N*sizeof(int));

        cudaMalloc((void**)&temp_c,N*sizeof(int));

for (i=0; i<N; i++) {

           a[i] = (float) (i)/(i+1);

           b[i] = (float) (i)/(i+1);

           c[i] = 0;

       }

        cudaEventRecord(start,0);

cudaMemcpy(temp_a,a,N*sizeof(int),cudaMemcpyHostToDevice);

        cudaMemcpy(temp_b,b,N*sizeof(int),cudaMemcpyHostToDevice);

cudaEventRecord(stop,0);

        cudaEventSynchronize(stop);

        cudaEventElapsedTime(&elapsed_time_cpu_gpu,start,stop);

        printf("Time taken CUDA : %f \n",elapsed_time_cpu_gpu);

cudaEventDestroy(start);

        cudaEventDestroy(stop); 

cudaEventRecord(startadd,0);

add<<<N,N>>>(temp_a,temp_b,temp_c);

cudaEventRecord(stopadd,0);

        cudaEventSynchronize(stopadd);

        cudaEventElapsedTime(&elapsed_time_add,startadd,stopadd);

printf("Time taken CUDA : %f \n",elapsed_time_add);

cudaEventDestroy(startadd);

        cudaEventDestroy(stopadd); 

cudaEventRecord(startback,0);

cudaMemcpy(c,temp_c, N*sizeof(int),cudaMemcpyDeviceToHost);

cudaEventRecord(stopback,0);

        cudaEventSynchronize(stopback);

        cudaEventElapsedTime(&elapsed_time_gpu_cpu,startback,stopback);

cudaEventDestroy(startback);

        cudaEventDestroy(stopback); 

/*        for (i=0; i<N; i++) {

            printf ("%f %f %f\n", a[i], b[i], c[i] );

        }

*/

cudaFree(temp_a);

        cudaFree(temp_b);

        cudaFree(temp_c);

        return 0;

}

If I measure the timings, Processing of add function in CUDA takes 35 ms and processing of add in normal C program takes 20 ms.

Also when I measure individual timing of copying from Device to host and host to devices, it takes around 28 ms for each making total of around 85 ms for executing the complete program.

May be I am making a mistake but I am not able to correct out as I am very new to CUDA.

Please help me in this regard.

Thanks

Deepak

Hello All,

I am Deepak and I am very new to CUDA. I started with the first program given in book CUDA BY EXAMPLE and i was shocked to see starnge results when i tried to measure the time of execution.

If I run the program using g++ compliler, it takes around 20ms to run the program but if I use CUDA, its takes 90 ms to run the program.

Following is my code in both versions.

include <stdio.h>

#include <time.h>

#include <math.h>

#include <stdlib.h>

#define N   10

void add( float *a, float *b, float *c ) {

    int tid = 0;

    while (tid < N) {

        c[tid] = (a[tid]/(a[tid]*a[tid])) + (b[tid]/(b[tid]*b[tid]));

        tid += 1;

    }

}

int main( void ) {

        float elapsed;

        float a[N], b[N], c[N]; 

        int i;

        clock_t timerStart, timerStop;

        for (i=0; i<N; i++) {

                a[i] = (float) (i)/(i+1);

                b[i] = (float) (i)/(i+1);

                c[i] = 0;

        }

        timerStart = clock();

        add( a, b, c );

        timerStop = clock();

        elapsed = (float) ( timerStop - timerStart ) / CLOCKS_PER_SEC;

        printf( "Time elapsed:  %f ", elapsed);

        for (i=0; i<N;i++)

        printf(" %f \n",c[i]);

return 0;

}

[b]

My CUDA Version is [/b]

#include <stdio.h>

#include <stdlib.h>

#include <string.h>

#include <cuda.h>

#include "cutil.h"

#define N 10

__global__ void add( float *a, float *b, float *c ) {

    int tid = blockIdx.x;   // TID is the block ID

    if (tid < N) {

       c[tid] = (a[tid]/(a[tid]*a[tid])) + (b[tid]/(b[tid]*b[tid]));

    }

}

int main( void ) {

        float a[N], b[N], c[N];

        float *temp_a,*temp_b,*temp_c;

        long i;

        float elapsed_time_cpu_gpu,elapsed_time_add,elapsed_time_gpu_cpu;

        cudaEvent_t start,stop,startadd,stopadd,startback,stopback;

        cudaEventCreate(&start);

        cudaEventCreate(&stop); 

        cudaEventCreate(&startadd);

        cudaEventCreate(&stopadd);      

        cudaEventCreate(&startback);

        cudaEventCreate(&stopback);     

cudaMalloc((void**)&temp_a,N*sizeof(int));

        cudaMalloc((void**)&temp_b,N*sizeof(int));

        cudaMalloc((void**)&temp_c,N*sizeof(int));

for (i=0; i<N; i++) {

           a[i] = (float) (i)/(i+1);

           b[i] = (float) (i)/(i+1);

           c[i] = 0;

       }

        cudaEventRecord(start,0);

cudaMemcpy(temp_a,a,N*sizeof(int),cudaMemcpyHostToDevice);

        cudaMemcpy(temp_b,b,N*sizeof(int),cudaMemcpyHostToDevice);

cudaEventRecord(stop,0);

        cudaEventSynchronize(stop);

        cudaEventElapsedTime(&elapsed_time_cpu_gpu,start,stop);

        printf("Time taken CUDA : %f \n",elapsed_time_cpu_gpu);

cudaEventDestroy(start);

        cudaEventDestroy(stop); 

cudaEventRecord(startadd,0);

add<<<N,N>>>(temp_a,temp_b,temp_c);

cudaEventRecord(stopadd,0);

        cudaEventSynchronize(stopadd);

        cudaEventElapsedTime(&elapsed_time_add,startadd,stopadd);

printf("Time taken CUDA : %f \n",elapsed_time_add);

cudaEventDestroy(startadd);

        cudaEventDestroy(stopadd); 

cudaEventRecord(startback,0);

cudaMemcpy(c,temp_c, N*sizeof(int),cudaMemcpyDeviceToHost);

cudaEventRecord(stopback,0);

        cudaEventSynchronize(stopback);

        cudaEventElapsedTime(&elapsed_time_gpu_cpu,startback,stopback);

cudaEventDestroy(startback);

        cudaEventDestroy(stopback); 

/*        for (i=0; i<N; i++) {

            printf ("%f %f %f\n", a[i], b[i], c[i] );

        }

*/

cudaFree(temp_a);

        cudaFree(temp_b);

        cudaFree(temp_c);

        return 0;

}

If I measure the timings, Processing of add function in CUDA takes 35 ms and processing of add in normal C program takes 20 ms.

Also when I measure individual timing of copying from Device to host and host to devices, it takes around 28 ms for each making total of around 85 ms for executing the complete program.

May be I am making a mistake but I am not able to correct out as I am very new to CUDA.

Please help me in this regard.

Thanks

Deepak

Hi Deepak,

Your program has a couple of errors. First, for the kernel call, that should be “add<<<N,1>>>” not “add<<<N,N>>>” (the book shows “add<<<N,1>>>”). With “N,N”, that would call the kernel in 10 blocks with 10 threads per block. But, there is also a problem in that the kernel is never called because of your event code. (I uncommented the code for the printf of a,b, and c, and the values for c are not set. I then put printf’s in the kernel, compiled with sm_20, ran it, and never saw a kernel being called.) So, I removed your event code and replaced it with a cudaThreadSynchronize. Fixing those two problems helps.

But, I probably wouldn’t set up the execution configuration like the do in the book. Although scalable, the call “add<<<N,1>>>” creates a grid with N blocks, with 1 thread per block. This doesn’t seem right because I think this will create one thread in a warp, and the GPU will serialize the warps. (The GPU may actually somehow combine the blocks into a warp, but I don’t know if it really does this.)

So, rewriting your code, I changed it to use 512 threads per block. Since automatic of large arrays blows the stack, I also changed them to malloc’s. With all these changes, I do get a speedup with N > 10,000, but nothing with lower values. For N = 10, the elapsed time for both function calls is “0”–too small to measure. For N=10,000,000, the runtime is: “Starting CPU test … Time elapsed: 0.255000, Starting GPU test …Time elapsed: 0.086000.”

Ken D.

#include <stdio.h>

#include <time.h>

#include <math.h>

#include <stdlib.h>

#define N   10000000

void addC( float *a, float *b, float *c ) {

    int tid = 0;

    while (tid < N) {

        c[tid] = (a[tid]/(a[tid]*a[tid])) + (b[tid]/(b[tid]*b[tid]));

        tid += 1;

    }

}

void t1()

{

    printf("Starting CPU test ...\n");

    float elapsed;

    float *a, *b, *c;

    a = (float*)malloc(N * sizeof(float));

    b = (float*)malloc(N * sizeof(float));

    c = (float*)malloc(N * sizeof(float));

    int i;

    clock_t timerStart, timerStop;

    for (i=0; i<N; i++) {

        a[i] = (float) (i)/(i+1);

        b[i] = (float) (i)/(i+1);

        c[i] = 0;

    }

timerStart = clock();

    addC( a, b, c );

    timerStop = clock();

elapsed = (float) ( timerStop - timerStart ) / CLOCKS_PER_SEC;

    printf( "Time elapsed:  %f\n", elapsed);

//       for (i=0; i<N;i++)

//        printf(" %f \n",c[i]);

    free(a);

    free(b);

    free(c);

}

__global__ void addG( float *a, float *b, float *c ) {

    int tid = threadIdx.x + blockIdx.x * blockDim.x;

    if (tid < N) {

       c[tid] = (a[tid]/(a[tid]*a[tid])) + (b[tid]/(b[tid]*b[tid]));

    }

}

void t2()

{

    printf("Starting GPU test ...\n");

    float elapsed;

    clock_t timerStart, timerStop;

    float *a, *b, *c;

    a = (float*)malloc(N * sizeof(float));

    b = (float*)malloc(N * sizeof(float));

    c = (float*)malloc(N * sizeof(float));

    float *temp_a,*temp_b,*temp_c;

    cudaMalloc((void**)&temp_a,N*sizeof(int));

    cudaMalloc((void**)&temp_b,N*sizeof(int));

    cudaMalloc((void**)&temp_c,N*sizeof(int));

for (int i=0; i<N; i++) {

        a[i] = (float) (i)/(i+1);

        b[i] = (float) (i)/(i+1);

        c[i] = 0;

    }

    timerStart = clock();

    cudaMemcpy(temp_a, a, N*sizeof(int), cudaMemcpyHostToDevice);

    cudaMemcpy(temp_b, b, N*sizeof(int), cudaMemcpyHostToDevice);

    addG<<<N/512 + 1, 512>>>(temp_a,temp_b,temp_c);

    cudaThreadSynchronize();

    int rv1 = cudaGetLastError();

    if (rv1)

    {

        printf("Fail1 %d\n", rv1);

        return;

    }

    cudaMemcpy(c,temp_c, N*sizeof(int),cudaMemcpyDeviceToHost);

    timerStop = clock();

    elapsed = (float) ( timerStop - timerStart ) / CLOCKS_PER_SEC;

    printf( "Time elapsed:  %f\n", elapsed);

/*  for (i=0; i<N; i++) {

        printf ("%f %f %f\n", a[i], b[i], c[i] );

    }

*/

    cudaFree(temp_a);

    cudaFree(temp_b);

    cudaFree(temp_c);

    free(a);

    free(b);

    free(c);

}

int main()

{

    t1();

    t2();

    return 0;

}

Hi Deepak,

Your program has a couple of errors. First, for the kernel call, that should be “add<<<N,1>>>” not “add<<<N,N>>>” (the book shows “add<<<N,1>>>”). With “N,N”, that would call the kernel in 10 blocks with 10 threads per block. But, there is also a problem in that the kernel is never called because of your event code. (I uncommented the code for the printf of a,b, and c, and the values for c are not set. I then put printf’s in the kernel, compiled with sm_20, ran it, and never saw a kernel being called.) So, I removed your event code and replaced it with a cudaThreadSynchronize. Fixing those two problems helps.

But, I probably wouldn’t set up the execution configuration like the do in the book. Although scalable, the call “add<<<N,1>>>” creates a grid with N blocks, with 1 thread per block. This doesn’t seem right because I think this will create one thread in a warp, and the GPU will serialize the warps. (The GPU may actually somehow combine the blocks into a warp, but I don’t know if it really does this.)

So, rewriting your code, I changed it to use 512 threads per block. Since automatic of large arrays blows the stack, I also changed them to malloc’s. With all these changes, I do get a speedup with N > 10,000, but nothing with lower values. For N = 10, the elapsed time for both function calls is “0”–too small to measure. For N=10,000,000, the runtime is: “Starting CPU test … Time elapsed: 0.255000, Starting GPU test …Time elapsed: 0.086000.”

Ken D.

#include <stdio.h>

#include <time.h>

#include <math.h>

#include <stdlib.h>

#define N   10000000

void addC( float *a, float *b, float *c ) {

    int tid = 0;

    while (tid < N) {

        c[tid] = (a[tid]/(a[tid]*a[tid])) + (b[tid]/(b[tid]*b[tid]));

        tid += 1;

    }

}

void t1()

{

    printf("Starting CPU test ...\n");

    float elapsed;

    float *a, *b, *c;

    a = (float*)malloc(N * sizeof(float));

    b = (float*)malloc(N * sizeof(float));

    c = (float*)malloc(N * sizeof(float));

    int i;

    clock_t timerStart, timerStop;

    for (i=0; i<N; i++) {

        a[i] = (float) (i)/(i+1);

        b[i] = (float) (i)/(i+1);

        c[i] = 0;

    }

timerStart = clock();

    addC( a, b, c );

    timerStop = clock();

elapsed = (float) ( timerStop - timerStart ) / CLOCKS_PER_SEC;

    printf( "Time elapsed:  %f\n", elapsed);

//       for (i=0; i<N;i++)

//        printf(" %f \n",c[i]);

    free(a);

    free(b);

    free(c);

}

__global__ void addG( float *a, float *b, float *c ) {

    int tid = threadIdx.x + blockIdx.x * blockDim.x;

    if (tid < N) {

       c[tid] = (a[tid]/(a[tid]*a[tid])) + (b[tid]/(b[tid]*b[tid]));

    }

}

void t2()

{

    printf("Starting GPU test ...\n");

    float elapsed;

    clock_t timerStart, timerStop;

    float *a, *b, *c;

    a = (float*)malloc(N * sizeof(float));

    b = (float*)malloc(N * sizeof(float));

    c = (float*)malloc(N * sizeof(float));

    float *temp_a,*temp_b,*temp_c;

    cudaMalloc((void**)&temp_a,N*sizeof(int));

    cudaMalloc((void**)&temp_b,N*sizeof(int));

    cudaMalloc((void**)&temp_c,N*sizeof(int));

for (int i=0; i<N; i++) {

        a[i] = (float) (i)/(i+1);

        b[i] = (float) (i)/(i+1);

        c[i] = 0;

    }

    timerStart = clock();

    cudaMemcpy(temp_a, a, N*sizeof(int), cudaMemcpyHostToDevice);

    cudaMemcpy(temp_b, b, N*sizeof(int), cudaMemcpyHostToDevice);

    addG<<<N/512 + 1, 512>>>(temp_a,temp_b,temp_c);

    cudaThreadSynchronize();

    int rv1 = cudaGetLastError();

    if (rv1)

    {

        printf("Fail1 %d\n", rv1);

        return;

    }

    cudaMemcpy(c,temp_c, N*sizeof(int),cudaMemcpyDeviceToHost);

    timerStop = clock();

    elapsed = (float) ( timerStop - timerStart ) / CLOCKS_PER_SEC;

    printf( "Time elapsed:  %f\n", elapsed);

/*  for (i=0; i<N; i++) {

        printf ("%f %f %f\n", a[i], b[i], c[i] );

    }

*/

    cudaFree(temp_a);

    cudaFree(temp_b);

    cudaFree(temp_c);

    free(a);

    free(b);

    free(c);

}

int main()

{

    t1();

    t2();

    return 0;

}

It does not. One thread per block underutilizes the multiprocessor by a factor of 32.

It does not. One thread per block underutilizes the multiprocessor by a factor of 32.

Hello kaberdude

Thanks for the pain which you have taken for me. I am new to CUDA and was losing interest in CUDA and was thinking to move to TBB. But your reply has surely helped me gain more insight about CUDA and concept of thread.

I think I need to read more about threads and other programming skills in CUDA.

Can you please suggest me some sources from where I can read and learn more about this. I am already reading CUDA by example.

Thanks

Deepak

Hello kaberdude

Thanks for the pain which you have taken for me. I am new to CUDA and was losing interest in CUDA and was thinking to move to TBB. But your reply has surely helped me gain more insight about CUDA and concept of thread.

I think I need to read more about threads and other programming skills in CUDA.

Can you please suggest me some sources from where I can read and learn more about this. I am already reading CUDA by example.

Thanks

Deepak

Hi Deepak, There are not a lot of books on CUDA. If you go to Amazon and look for CUDA, you get CUDA by Example (ISBN 0131387685) and Programming Massively Parallel Processors (ISBN 0123814723). CUDA by Example is OK for the beginner, with a number of examples. The Programming Massively Parallel Processors book seems to cover more topics, but with fewer examples. Neither seem to go into basic parallel programming topics, like sorting and searching. So, I usually go to conference proceedings, workshops, and journal articles to get more information on how CUDA is used to solve a particular problem (see scholar.google.com). Unfortunately, those sources can have a lot of typos. Also, although very painful to read grungy code, you can go through the NVIDIA GPU Computing SDK examples. Ken

Hi Deepak, There are not a lot of books on CUDA. If you go to Amazon and look for CUDA, you get CUDA by Example (ISBN 0131387685) and Programming Massively Parallel Processors (ISBN 0123814723). CUDA by Example is OK for the beginner, with a number of examples. The Programming Massively Parallel Processors book seems to cover more topics, but with fewer examples. Neither seem to go into basic parallel programming topics, like sorting and searching. So, I usually go to conference proceedings, workshops, and journal articles to get more information on how CUDA is used to solve a particular problem (see scholar.google.com). Unfortunately, those sources can have a lot of typos. Also, although very painful to read grungy code, you can go through the NVIDIA GPU Computing SDK examples. Ken

Hello Ken,

Thanks for the advice.

Also when I have one question which is boggling me. Why CUDA doesn’t give optimization for smaller N i.e N < 10,000.

I just used <<<N,N>>> intently so as to see the difference in result from <<<N,1>>> but their was not much of difference.

I future plan is to work on images and modify the existing algorithms using CUDA so as to process resukts in short time.

Deepak

Hello Ken,

Thanks for the advice.

Also when I have one question which is boggling me. Why CUDA doesn’t give optimization for smaller N i.e N < 10,000.

I just used <<<N,N>>> intently so as to see the difference in result from <<<N,1>>> but their was not much of difference.

I future plan is to work on images and modify the existing algorithms using CUDA so as to process resukts in short time.

Deepak

You should read CUDA C Programming Guide for a deeper coverage of various features. Best Practices Guide goes over the main optimizations. Both are included with the CUDA toolkit, just check the document directory. Also, you should check out the various CUDA webinars and tutorials:

You should read CUDA C Programming Guide for a deeper coverage of various features. Best Practices Guide goes over the main optimizations. Both are included with the CUDA toolkit, just check the document directory. Also, you should check out the various CUDA webinars and tutorials:

Thanks Paulius for the info.

This are indeed very good sources.

Thanks again.

Deepak

Thanks Paulius for the info.

This are indeed very good sources.

Thanks again.

Deepak