Hi. I basically have the same problem as you have. I have implemented an initial version of the kernel, but as am I am new to CUDA I only have 1.3GFlops. I am working to improve on that but I do not know exactly how. I know I must have bank conflicts but currently I do not know how to solve them. I am also thinking to use textures, at least for the vector, which should benefit from caching. I paste the kernel code below. I am assuming the order of the matrix to be a multiple of 255. I chose this number since I am using 256 threads, I want to use them all to read the row indices at the beginning for all the lines processed by the block, and the number of elements in the row index vector should be the number of lines + 1.
I am probably not the best one to give advice, but regarding your code:

32 seems a too small number of threads

You are not using shared memory to coalesce you reads, you read all the time directly from the global arrays
I tried to coalesce my reads as much as I could … didn’t do much of a job yet …
Regards,
Serban
#include <stdio.h>
#define MAX_ELEMENTS_PER_LINE 27
__global__ void spmvx_v1_kernel(const float *d_vals, const int *d_colIndex, const int *d_rowIndex, const float *d_b, float *d_res)
{
int iSet;
/* Get various data */
const int tx = threadIdx.x;
const int bx = blockIdx.x;
const int numThreads = blockDim.x;
const int numLinesToProcess = numThreads  1;
const int numLinesProcessedOnce = numThreads / MAX_ELEMENTS_PER_LINE;
const int numSets = numLinesToProcess / numLinesProcessedOnce;
/* Alloc shared memory */
__shared__ int rowIndex[256]; // Read for all lines of the blovk
__shared__ int colIndex[256]; // Just for a set of 255/27 = 9 lines
__shared__ float vals[256]; // Just for a set
__shared__ float b[256]; // Just for a set
/* Compute what to process */
// First and last row processed by the block
const int rowStart = numLinesToProcess * bx;
const int rowEnd = rowStart + numLinesToProcess;
/* Read the rowIndex data for all rows using all threads*/
rowIndex[tx] = d_rowIndex[rowStart + tx];
__syncthreads();
/* Now process the full sets of lines (the rest later)*/
for (iSet = 0; iSet < numSets; iSet++)
{
// Compute the locations of this set
int setStart = rowIndex[iSet * numLinesProcessedOnce]  1;
int setEnd = rowIndex[(iSet + 1) * numLinesProcessedOnce]  1;
int setLength = setEnd  setStart;
// Read the colIndex, vals and b
if (tx < setLength)
{
colIndex[tx] = d_colIndex[setStart + tx]  1;
vals[tx] = d_vals[colIndex[tx]];
b[tx] = d_b[colIndex[tx]];
}
__syncthreads();
// Compute the product for all rows in the set and write it
if (tx < numLinesProcessedOnce)
{
// Compute
float sum = 0;
int offset = iSet * numLinesProcessedOnce;
for (int i = rowIndex[tx + offset]  1; i < rowIndex[tx + offset + 1]  1; i++)
sum += vals[i  setStart] * b[i  setStart];
// Write to global mem
d_res[rowStart + iSet*numLinesProcessedOnce + tx] = sum;
}
}
/* Now do the same for what is left of the last set */
{
// Compute the locations of this set
int setStart = rowIndex[numSets * numLinesProcessedOnce]  1;
int setEnd = rowIndex[numLinesToProcess]  1;
int setLength = setEnd  setStart;
// Read the colIndex, vals and b
if (tx < setLength)
{
colIndex[tx] = d_colIndex[setStart + tx]  1;
vals[tx] = d_vals[colIndex[tx]];
b[tx] = d_b[colIndex[tx]];
}
__syncthreads();
// Compute the product for all rows in the set and write it
if (tx < numLinesToProcess  (iSet) * numLinesProcessedOnce)
{
// Compute
float sum = 0;
int offset = iSet * numLinesProcessedOnce;
for (int i = rowIndex[tx + offset]  1; i < rowIndex[tx + offset + 1]  1; i++)
sum += vals[i  setStart] * b[i  setStart];
// Write to global mem
d_res[rowStart + iSet*numLinesProcessedOnce + tx] = sum;
}
}
}