# Hello World Matrix Adder program Add two matrix ( used shared memory for optimum performance)

Hi all,

This is my first hello World program! Its running greatly on Tesla GPU platform. I found some problem with the N=1000. Can any one clarify whats wrong with the code.

[codebox]/*

Program to compute the sum of two arrays of size N using GPU

@author Sajan Kumar.S

@email: nospam+ammasajan[A.T]gmail

• com

*/

#include <stdio.h>

#include <stdlib.h>

#define N 20 // 20 elements

global void vecAdd(int *A, int *B, int *C){

``````     int i=threadIdx.x;
``````

shared int s_A[N],s_B[N],s_C[N];

// copy the values to shared mem and attack! :D

s_A[i]=A[i];

``````    s_B[i]=B[i];
``````

// C[i]=A[i]+B[i];

// s_C[i]=s_A[i]+s_B[i]; // to calucate the sume of elements

``````    s_C[i]=s_A[i]*s_B[i]; // to caluclate the sume of elements

``````

C[i]=s_C[i];

}

int main(){

int *h_a=0,*h_b=0,*h_c=0;

``````    int *d_a=0,*d_b=0,*d_c=0;

int memSize=N*sizeof(int);
``````

// allocate host memory size of N

``````    h_a=(int *)malloc(memSize);

h_b=(int *)malloc(memSize);

h_c=(int *)malloc(memSize);
``````

// allocate GPU memory size of N

``````    cudaMalloc((void **)&d_a,memSize);

cudaMalloc((void **)&d_b,memSize);

cudaMalloc((void **)&d_c,memSize);
``````

// Init values to A and B arrays(clearing C array)

``````    for(int i=0;i<N;i++){

h_a[i]=i+2;

h_b[i]=i+3;

h_c[i]=0;

}
``````

// Copied the values to GPU arrays A and B

``````    cudaMemcpy(d_a,h_a,memSize,cudaMemcpyHostToDevice);

cudaMemcpy(d_b,h_b,memSize,cudaMemcpyHostToDevice);
``````

// printing the A array and B array on CPU

``````    printf("\n Array A : \n");

for(int i=0;i<N;i++)

printf("%d\t",h_a[i]);

printf("\n Array B : \n");

for(int i=0;i<N;i++)

printf("%d\t",h_b[i]);

printf("\ncalucalting Sum : ");

``````

// copying the output C from GPU to mem

``````    cudaMemcpy(h_c,d_c,memSize,cudaMemcpyDeviceToHost);
``````

printf("\nSum of Arrays: \n");

``````    for(int i=0;i<N;i++)

printf("%d\t",h_c[i]);
``````

cudaFree(d_a);

``````    cudaFree(d_b);

cudaFree(d_c);
``````

free(h_a);

``````    free(h_b);

free(h_c);
``````

return 1;

}

``                                                               [/codebox]``
• The CUDA block size limit is 512. This:

``````vecAdd<<<1, N>>>(d_a,d_b,d_c);
``````

will not work for N>512. You will have to enlarge the grid for larger vectors.

Calculate array indices like this when you have more than one block:

int idx = blockDim.x, blockIdx.x + threadIdx.x;

or you can do:

int idx = __umul24( blockDim.x, blockIdx.x) + threadIdx.x;

as this will be faster !

Some beginners CUDA tricks can be found here

The use of shared memory in this program is also unnecessary. Shared memory is useful when you need to use a value more than once, or possibly read elements and operate on them in a different order. Here each thread reads an element from A, one from B and writes the sum (actually you have the product uncommented) to C. That can be done directly with no syncthreads and no shared memory storage.