I am making an application which need to multiply a (row) vector by a matrix (y = x*A). So I’ve done this code:
__global__ void calc(float* mat, float* in, float* out, int tam) {
int ix = blockIdx.x*blockDim.x + threadIdx.x;
float ans = 0.;
int j = 0;
for (int i = ix; i < tam*tam; i+=tam) {
ans += mat[i]*in[j];
j++;
}
out[ix] = ans;
}
It is correct, I’ve tested it against several matrices/vectors and it is all fine.
But the problem is that when I run the cudaprof with a small example (64x64) and I get the following result: “gld coalesced = 128” and “gld uncoalesced = 2048”. I don’t know where does these uncoalesced access come from.
You will get uncoalesced access on 1.1 hardware any time threads in a half-warp don’t sequentially access contiguous 32/64/128 byte blocks of global memory which are aligned on 32/64/128 byte boundaries.
So (by my rather inexpert eye) your loads of in will never coalesce (every thread in the half warp reads the same value at each loop iteration), and your loads of mat would only coalesce when the stride is aligned to 64 byte boundaries.
Re-read the complete chapter of the user guide you quoted from, because you seem to have misunderstood it. Your arrays are loading and storing to global memory, which is uncached. Constant memory and constant memory caching have no bearing on your problem.
You can probably coalesce the loads by having each thread perform staged loads to shared memory in front of a synchronization barrier first, then read from shared memory inside the loop. There is only 16k of shared memory per block, so you will have to think carefully about block sizes and, (perhaps) decomposing the operation into a smaller set of sub calculations which can work inside the shared memory limit . There is a very good set of slides written by Mark Harris from NVIDIA from SC’08 which discuss memory coalescing strategies in detail. You might find useful to study.