So as my first CUDA program, I thought I’d try a 1D heat equation solver and compare it to one I’d made in GLSL in the oldschool gpgpu fashion.
So I now have a working version which renders the heat distribution to screen every so often (enough to appear animated). It was a quick unoptimised initial attempt, and it worked. My plan was then to improve it by using shared and constant memories where applicable (assuming I knew what ‘applicable’ entailed).
My solution was as follows: Store temperature values in a linear block of global memory. Take each diagonal from the tri-diagonal update matrix, and store them in an array each. To update/iterate, I call the kernel and it performs the necessary operations: x_new[i] = L[i] * x[i-1] + D[i] * x[i] + U[i] *x[i+1]; where L,D,U are lower, main and upper diagonals respectively.
As you can see, each thread accesses its neighbours values too, so I thought shared memory would be useful, so that 1 (rather than 3) global memory accesses occur per element. However, this did not change anything performance wise.
My kernel looks like this :
[codebox]
global_ static void HeatEqn2(float2* vbo, float* X, float* Y, float* L, float* D, float* U)
{
__shared__ float line[BLOCK_SIZE+2];
int idx = threadIdx.x + blockDim.x * blockIdx.x;
line[threadIdx.x+1] = X[idx];
if (threadIdx.x==1)
line[0] = X[idx-2];
else if (threadIdx.x==0)
line[blockDim.x+1] = X[idx+blockDim.x];
__syncthreads();
// OLD WAY --> float val = L[idx] * X[idx-1] + D[idx]*X[idx] + U[idx]*X[idx+1];
float val = L[idx] * line[threadIdx.x] + D[idx]*line[threadIdx.x+1] + U[idx]*line[threadIdx.x+2];
Y[idx] = val;
vbo[idx].y = val*300;
}[/codebox]
The shared block has two extra elements to account for the out-of-block accesses. vbo is the vertex buffer used for rendering. In the ‘unoptimised’ (but yet equally performing) kernel, I had no shared line memory, and instead accessed directly from X.
Can anyone explain to me why this does not yield significant improvements.
In addition to this, due to the fact that L,D,U do not change, I thought I should place them in on-chip constant memory for faster access, but that had no improvements either.
My block size is 256x1x1 and for now the grid-size is just 4x1.
Thanks in advance