Hello World Matrix Adder program Add two matrix ( used shared memory for optimum performance)

Ammasajan · March 1, 2010, 2:15pm

Hi all,

This is my first hello World program! Its running greatly on Tesla GPU platform. I found some problem with the N=1000. Can any one clarify whats wrong with the code.

[codebox]/*

Program to compute the sum of two arrays of size N using GPU

@author Sajan Kumar.S

@email: nospam+ammasajan[A.T]gmail

com

*/

#include <stdio.h>

#include <stdlib.h>

#define N 20 // 20 elements

global void vecAdd(int *A, int *B, int *C){

     int i=threadIdx.x;

shared int s_A[N],s_B[N],s_C[N];

// copy the values to shared mem and attack! :D

s_A[i]=A[i];

    s_B[i]=B[i];

__syncthreads();

// C[i]=A[i]+B[i];

// s_C[i]=s_A[i]+s_B[i]; // to calucate the sume of elements

    s_C[i]=s_A[i]*s_B[i]; // to caluclate the sume of elements

    __syncthreads();

C[i]=s_C[i];

}

int main(){

int *h_a=0,*h_b=0,*h_c=0;

    int *d_a=0,*d_b=0,*d_c=0;

    int memSize=N*sizeof(int);

// allocate host memory size of N

    h_a=(int *)malloc(memSize);

    h_b=(int *)malloc(memSize);

    h_c=(int *)malloc(memSize);

// allocate GPU memory size of N

    cudaMalloc((void **)&d_a,memSize);

    cudaMalloc((void **)&d_b,memSize);

    cudaMalloc((void **)&d_c,memSize);

// Init values to A and B arrays(clearing C array)

    for(int i=0;i<N;i++){

            h_a[i]=i+2;

            h_b[i]=i+3;

            h_c[i]=0;

    }

// Copied the values to GPU arrays A and B

    cudaMemcpy(d_a,h_a,memSize,cudaMemcpyHostToDevice);

    cudaMemcpy(d_b,h_b,memSize,cudaMemcpyHostToDevice);

// printing the A array and B array on CPU

    printf("\n Array A : \n");

    for(int i=0;i<N;i++)

            printf("%d\t",h_a[i]);

    printf("\n Array B : \n");

    for(int i=0;i<N;i++)

            printf("%d\t",h_b[i]);

    printf("\ncalucalting Sum : ");

    vecAdd<<<1, N>>>(d_a,d_b,d_c);

// copying the output C from GPU to mem

    cudaMemcpy(h_c,d_c,memSize,cudaMemcpyDeviceToHost);

printf(“\nSum of Arrays: \n”);

    for(int i=0;i<N;i++)

            printf("%d\t",h_c[i]);

cudaFree(d_a);

    cudaFree(d_b);

    cudaFree(d_c);

free(h_a);

    free(h_b);

    free(h_c);

return 1;

}

                                                               [/codebox]

avidday · March 1, 2010, 2:19pm

The CUDA block size limit is 512. This:

vecAdd<<<1, N>>>(d_a,d_b,d_c);

will not work for N>512. You will have to enlarge the grid for larger vectors.

heshsham_India · March 4, 2010, 9:15am

Calculate array indices like this when you have more than one block:

int idx = blockDim.x, blockIdx.x + threadIdx.x;

or you can do:

int idx = __umul24( blockDim.x, blockIdx.x) + threadIdx.x;

as this will be faster !

Some beginners CUDA tricks can be found here

seibert · March 4, 2010, 2:20pm

The use of shared memory in this program is also unnecessary. Shared memory is useful when you need to use a value more than once, or possibly read elements and operate on them in a different order. Here each thread reads an element from A, one from B and writes the sum (actually you have the product uncommented) to C. That can be done directly with no syncthreads and no shared memory storage.

Topic		Replies	Views
basic matrix addition CUDA Programming and Performance	3	1929	March 9, 2012
2Arrays addition program give wrong results for large size arrays CUDA Programming and Performance	4	626	August 28, 2016
CUDA Matrix Addition - 1D Memory, threads and blocks in 1D Matrix Addition in CUDA C using Texture a CUDA Programming and Performance	1	2367	November 26, 2011
limit of computation CUDA Programming and Performance	44	33198	April 8, 2008
CUDA Matrix Addition - 1D Memory, threads and blocks in 1D using global memory CUDA Programming and Performance	1	2662	November 27, 2011
CUDA Matrix Addition - 2D Memory, threads and blocks in 2D Matrix Addition in CUDA C using Texture a CUDA Programming and Performance	1	14747	November 27, 2011
CUDA Matrix Addition - 1D Memory, threads and blocks in 1D Matrix Addition in CUDA C using global m CUDA Programming and Performance	0	1097	November 26, 2011
memory function does not see the memory, although the device array is copied CUDA Programming and Performance	2	2601	May 29, 2010
Multiplying two arrays CUDA Programming and Performance	6	5320	May 7, 2008
I want to calculate the sum of the 512 lines CUDA Programming and Performance	16	2230	January 4, 2013

Hello World Matrix Adder program Add two matrix ( used shared memory for optimum performance)

Related topics