I’m trying to learn CUDA for a school project, so i’ve written a little program to multiply the elements of two arrays together.
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <math.h>
const int ARRAY_SIZE = 16000;
__global__ void mulArrOnD(const int arrSize, int *array1, int *array2, int *array3 )
{
__syncthreads();
int tx = blockIdx.x * 256 + threadIdx.x;
array3[tx] = array1[tx] * array2[tx];
__syncthreads();
}
int main()
{
dim3 threads(256);
dim3 grid(ARRAY_SIZE/256);
int arr1 [ARRAY_SIZE];
int arr2 [ARRAY_SIZE];
int *arr3;
arr3 = (int*) malloc(ARRAY_SIZE*sizeof(int));
int *arr1D; //pointer to arr1 on device
int *arr2D; //pointer for arr2 on device
int *arr3D; //pointer for product array on device
int r; //variable to store random integer
//size of memory to allocate on device
int sizeOfAllocation = ARRAY_SIZE*sizeof(int);
//allocates the memory for the arrays on device
//returns the pointer to the allocated memory
cudaMalloc((void **)&arr1D, sizeOfAllocation);
cudaMalloc((void **)&arr2D, sizeOfAllocation);
cudaMalloc((void **)&arr3D, sizeOfAllocation);
//populate arr1 and arr2 with random numbers
for(int i = 0; i < ARRAY_SIZE; i++)
{
r = (rand() % 10) + 1;
arr1[i]=r;
}
for(int j = 0; j < ARRAY_SIZE; j++)
{
r = (rand() % 10) + 1;
arr2[j]=r;
}
//copies the arrays on host memory to the memory on device
cudaMemcpy(arr1D, arr1, sizeof(int) * ARRAY_SIZE, cudaMemcpyHostToDevice);
cudaMemcpy(arr2D, arr2, sizeof(int) * ARRAY_SIZE, cudaMemcpyHostToDevice);
//multiply arr1D and arr2D, store results in arr3d
mulArrOnD<<< grid, threads >>>(ARRAY_SIZE, arr1D, arr2D, arr3D);
//copy product array from device to host (dest, src, size, type of cpy)
cudaMemcpy(arr3, arr3D, sizeof(int)* ARRAY_SIZE, cudaMemcpyDeviceToHost);
//display the contents of the product array
for(int k = 0; k < ARRAY_SIZE; k++)
printf("%d \n", arr3[k]);
}
It works, but the problem is that the last elements in the array aren’t multiplied because i’m using (ARRAY_SIZE / # of threads) to determine the number of blocks to use. This doesn’t perfectly divide, and therefore there aren’t enough blocks to handle the entirety of the array.
Simply creating a grid with more blocks than I need crashes the program at runtime.
so basically I just want to ensure that all of the elements in the array are multiplied and the program doesn’t try to multiply elements which don’t exist.
Thanks, that got it to work. I’m now trying to implement a solution using the shared memory.
this is the current kernel
__global__ void mulArrOnD(const int arrSize, int *array1, int *array2, int *array3 )
{
int tx = blockIdx.x * 256 + threadIdx.x;
__shared__ int ArrShared1[256];
__shared__ int ArrShared2[256];
ArrShared1[tx] = array1[tx];
ArrShared2[tx] = array2[tx];
__syncthreads();
if (tx < arrSize)
array3[tx] = ArrShared1[tx] * ArrShared2[tx];
__syncthreads();
}
This now causes the program to crash, and i’m not sure why. Any ideas?
Also, the method of copying to shared memory seems very redundant. Each thread copies two elements of an array from global memory to shared memory which is no better than what was being done previously.
Is there a way to have one thread simply copy a block of 256 elements to the shared memory and have the other threads in the warp access the shared memory instead of each thread having to go to the global memory?
The indices of ArrShared need to be threadIdx.x (otherwise you are indexing outside the range 0,…,255).
Shared memory will not give you extra speed for this kernel.
You allocate 256 shared ints, and access it with an integer that can be ceil(ARRAY_SIZE/256) * 256 (so you are accessing the shared arrays way beyond the boundaries)
You are accessing the array1 & array2 arrays beyond the boundaries like you did before.
But the fundamental thing here is that this will not get you any performance benefits. The algorithm is not suitable for speedup with shared memory. Take a look at the matrix-multiplication example (there are also sheets online) to see how there you have benefit by using shared memory.
I’m a beginner in cuda and i decided to take this example because it’s simple to understand.
I nedd precisions about this :
const int ARRAY_SIZE = 16000;
dim3 grid(ARRAY_SIZE/256);
I don’t understand very well in term of limits, i mean if i put :
const int ARRAY_SIZE = 700000 for example all the values will not be calculated.
So can someone explains me :
what is the limit of number of grids ?
is 512 the limit of threads per block correct ?
And how can i address 700000 int for example in one pass ( if it’s possible ) ?