Optimising a Sparse Matrix Routine for a 3D Grid?

I’m looking to optimise an algorithm for performing a matrix-vector multiplication for a very sparse matrix of a distinct and consistent layout. As the underlying data is based on a 3D grid, the data is accessed through using offsets in this grid. Here is the kernel that I wish to optimise:

__global__ void kernel(float* cdiag, float* ci, float* cj, float* ck, float* cij, float* cik, float* cjk, double* r, double * z)

{

	int nx = 200;

	int ny = 200;

	int nz = 200;

	int ijk = (blockIdx.x * gridDim.x) + threadIdx.x;

	int iMjk = ijk - 1;

	int iPjk = ijk + 1;

	int ijMk = ijk - nx;

	int ijPk = ijk + nx;

	int ijkM = ijk - nx * ny;

	int ijkP = ijk + nx * ny;

	int iPjMk = ijk - nx + 1;

	int iMjPk = ijk + nx - 1;

	int iMjkP = ijk + nx * ny - 1;

	int ijMkP = ijk + nx * ny - nx;

	int iPjkM = ijk - nx * ny + 1;

	int ijPkM = ijk - nx * ny + nx;

	double result = 0;

	result += ci[ijk] * r[iPjk];

	result += cj[ijk] * r[ijPk];

	result += ck[ijk] * r[ijkP];

	result += cdiag[ijk] * r[ijk];

	result += ci[iMjk] * r[iMjk];

	result += cj[ijMk] * r[ijMk];

	result += ck[ijkM] * r[ijkM];

	result += cij[iPjk] * r[iPjMk];

	result += cij[ijPk] * r[iMjPk];

	result += cik[iPjk] * r[iPjkM];

	result += cik[ijkP] * r[iMjkP];

	result += cjk[ijPk] * r[ijPkM];

	result += cjk[ijkP] * r[ijMkP];

	

	z[ijk] = result;

}

My first thoughts were to launch the kernels in blocks of 256 threads (8 x 8 x 4), then load all the r values into shared memory and rely on the repetition of accesses throughout the block to significantly save me on global memory loading, which has worked fairly well. I have also managed to fit the ci, cj and ck values into shared memory, through with limited success (I believe the shared memory data used starts to limit the occupancy at this stage).

I’d really like to reduce the uncoalesced accesses, but I clearly won’t fill all this data neatly into shared memory in a coalesced form. Can anyone suggest what might be my best options for improving the efficiency of this code? I would reduce the number of threads down from 256 to increase the amount of shared memory available for each thread, but I suspect this will impact greatly on hiding global memory latency.

I’m still relatively new to GPU programming so any advice on the direction to take at this point would be really useful. I’ve been experimenting and studying the cudaprof results extensively and am a bit out of ideas.

Thanks,

Dan

If the matrix is very sparse, and the structure very predictable (for example a diagonal banded matrix), wouldn’t it be better to just store the non zeros? You would probably have a better chance of exploiting shared memory and getting the reads coalesced.

There are already some very high performance sparse multiply routines available for the comment sparse matrix formats (csc,csr,dia) which you might be interested in looking at. Some more details can be found here.

These are all the non-zero values! I know there will always be 13 multiplications (out of just under 8 million non-zero elements not stored) to perform for every element to be calculated, that is what is being represented here.

I have been through the research you linked to before. Compressed sparse row and other such formats are ideal for situations where there may be variations in the number of elements stored in each row, hence the data structures are adaptable to a number of different sparse matrix layouts. However, the data I am using is precisely laid out such that I know exactly which elements to multiply by which other elements, so I must surely be able to better exploit shared memory and coalesced reads in this instance than by using generic sparse algorithms and data structures?

I’m just a little unsure about how to utilise this data in a coalesced fashion as I clearly don’t have enough shared memory to work with for the entire data in each block?

I was thrown by the large stride values in your indexing calculations. What is the maximum number of non zeros are in a row of the matrix?

Most (90% or so) will have 13 non-zero values (as represented by these multiplications here), the remainder will have fewer than this.

So, I was thinking maybe pad the individual resolutions of the grid to be a multiple of 16 so that I can find relative, coalesced offsets using these strides?

Another thought, due to being limited by shared memory, was to try and intersperse the loading of global memory into shared memory with the addition to z, in order to reuse shared memory for different values whilst keeping the reads coalesced?