I am trying to use CUDA to speed up my program. But I am not very sure how to use the share memory. I bought the book “Programming massively parallel processors” which has some samples, but I feel the sample (matrix computing) is way too easy and real world application is not so easy to follow the sample.
In my case, I want to use an 1d array of float values (suppose it has m elements, m is about 1000) to interpolate a 2d array (suppose it has n * k elements, n and k is about 500). When calculating any one element in the 2d array, it needs two items of the 1d array to calculate. But using which 2 items depends on some condition. I am sure it is a very simple program and it is a good candidate program for CUDA.
It is very easy to implement a simple code to use GPU to calculate, but it is actually way slower (5x) than regular CPU code. Then I start to look into reduce the global memory access ratio. Of course the first step is, trying to put the 1d array (about 4k in size) into shared memory of blocks. That supposes to be must faster than the array is in global memory. But I am not sure how do put the 1d array into the shared memory.
The sample code on the book is like this,
shared float Mds[TILE_WIDTH][TILE_WIDTH]; // TILE_WIDTH is constant value 2.
But when it access it, it looks like the index is greater than TILE_WIDTH.
Here is my simplified kernel code (my 2d array is still represented as 1d array, stored row by row).
global void MyKernel(float* array1d, float* array2d, int m, int n, int k, /* some other args */)
{
// obviously in this case, both array 1d and array2d is in global memory.
// question is how to put array1d in shared memory?
int i = blockIdx.x;
int j = threadIdx.x;
int index = j * k + i
int i0 = x; // x actually is calculated from i, j and other args.
int i1 = x + 1;
array2d[index] = (array1d[i0] + array1d[i1]) / 2.0f; // actually is a more complicated and time consuming computing with other args.
}
Here is how MyKernel is called,
dim3 dimBlock(n, 1);
dim3 dimGrid(k, 1);
MyKernel<<<dimGrid, dimBlock>>>(array1d, array2d, m, n, k, … other args… ).
Anybody can give me some advice? Thanks in advance.