Multiplying two arrays

I’m trying to learn CUDA for a school project, so i’ve written a little program to multiply the elements of two arrays together.

#include <stdlib.h>

#include <stdio.h>

#include <string.h>

#include <math.h>

const int ARRAY_SIZE = 16000;

__global__ void mulArrOnD(const int arrSize, int *array1, int *array2, int *array3 )




  int tx = blockIdx.x * 256 + threadIdx.x;

 array3[tx] = array1[tx] * array2[tx];




int main()



	dim3  threads(256);

    dim3  grid(ARRAY_SIZE/256);

   int arr1 [ARRAY_SIZE];

	int arr2 [ARRAY_SIZE];

	int *arr3;

	arr3 = (int*) malloc(ARRAY_SIZE*sizeof(int));


    int *arr1D; //pointer to arr1 on device

	int *arr2D; //pointer for arr2 on device

	int *arr3D; //pointer for product array on device

	int r; //variable to store random integer


	//size of memory to allocate on device

	int sizeOfAllocation = ARRAY_SIZE*sizeof(int); 


	//allocates the memory for the arrays on device

	//returns the pointer to the allocated memory

	cudaMalloc((void **)&arr1D, sizeOfAllocation);

	cudaMalloc((void **)&arr2D, sizeOfAllocation);

	cudaMalloc((void **)&arr3D, sizeOfAllocation);


	//populate arr1 and arr2 with random numbers

	for(int i = 0; i < ARRAY_SIZE; i++)


    r = (rand() % 10) + 1;



	for(int j = 0; j < ARRAY_SIZE; j++)


    r = (rand() % 10) + 1;





    //copies the arrays on host memory to the memory on device

	cudaMemcpy(arr1D, arr1, sizeof(int) * ARRAY_SIZE, cudaMemcpyHostToDevice);

	cudaMemcpy(arr2D, arr2, sizeof(int) * ARRAY_SIZE, cudaMemcpyHostToDevice);


   //multiply arr1D and arr2D, store results in arr3d

    mulArrOnD<<< grid, threads >>>(ARRAY_SIZE, arr1D, arr2D, arr3D);

	//copy product array from device to host (dest, src, size, type of cpy)

	cudaMemcpy(arr3, arr3D, sizeof(int)* ARRAY_SIZE,  cudaMemcpyDeviceToHost);


	//display the contents of the product array

	for(int k = 0; k < ARRAY_SIZE; k++)

  printf("%d \n", arr3[k]);


It works, but the problem is that the last elements in the array aren’t multiplied because i’m using (ARRAY_SIZE / # of threads) to determine the number of blocks to use. This doesn’t perfectly divide, and therefore there aren’t enough blocks to handle the entirety of the array.

Simply creating a grid with more blocks than I need crashes the program at runtime.

so basically I just want to ensure that all of the elements in the array are multiplied and the program doesn’t try to multiply elements which don’t exist.

Any ideas on how to handle this?

global void mulArrOnD(const int arrSize, int *array1, int *array2, int *array3 )

int tx = blockIdx.x * 256 + threadIdx.x;

if (tx < arrSize) array3[tx] = array1[tx] * array2[tx];


BTW, you don’t need the syncthread calls, you are not using shared memory.
Change the grid size to schedule enough blocks to cover the whole array.

Thanks, that got it to work. I’m now trying to implement a solution using the shared memory.

this is the current kernel

__global__ void mulArrOnD(const int arrSize, int *array1, int *array2, int *array3 )



  int tx = blockIdx.x * 256 + threadIdx.x;

 __shared__ int ArrShared1[256];

  __shared__ int ArrShared2[256];


  ArrShared1[tx] = array1[tx];

  ArrShared2[tx] = array2[tx];





 if (tx < arrSize) 

  	array3[tx] = ArrShared1[tx] * ArrShared2[tx];



This now causes the program to crash, and i’m not sure why. Any ideas?

Also, the method of copying to shared memory seems very redundant. Each thread copies two elements of an array from global memory to shared memory which is no better than what was being done previously.

Is there a way to have one thread simply copy a block of 256 elements to the shared memory and have the other threads in the warp access the shared memory instead of each thread having to go to the global memory?


The indices of ArrShared need to be threadIdx.x (otherwise you are indexing outside the range 0,…,255).
Shared memory will not give you extra speed for this kernel.

ArrShared1[tx] = array1[tx];

ArrShared2[tx] = array2[tx];

Has 2 errors :

  • You allocate 256 shared ints, and access it with an integer that can be ceil(ARRAY_SIZE/256) * 256 (so you are accessing the shared arrays way beyond the boundaries)

  • You are accessing the array1 & array2 arrays beyond the boundaries like you did before.

if (tx < arrSize) {

  ArrShared1[threadIdx.x] = array1[tx];

  ArrShared2[threadIdx.] = array2[tx];


But the fundamental thing here is that this will not get you any performance benefits. The algorithm is not suitable for speedup with shared memory. Take a look at the matrix-multiplication example (there are also sheets online) to see how there you have benefit by using shared memory.

<I need to start to learn to type faster :D )

Hi everybody !
Sorry for my english …

I’m a beginner in cuda and i decided to take this example because it’s simple to understand.
I nedd precisions about this :

const int ARRAY_SIZE = 16000;
dim3 grid(ARRAY_SIZE/256);

I don’t understand very well in term of limits, i mean if i put :
const int ARRAY_SIZE = 700000 for example all the values will not be calculated.

So can someone explains me :
what is the limit of number of grids ?
is 512 the limit of threads per block correct ?
And how can i address 700000 int for example in one pass ( if it’s possible ) ?



BKCO, as I understand it, if you have 256 threads per block and each calculates a value, then you would need 700000 / 256 = 2735 blocks per grid.

The maximum number of blocks per grid is in the programmers guide, but I believe it is an extraordinarily large number.

I’m not sure if this is the question you had or not.