Hello everyone,
I am trying to write some code that, given an array V containing N d-dimensional vectors and another d-dimensional vector (say x), determines the dot product between the vectors in V and x. The array V is arranged so that V[0] through V[d - 1] store the components of the first vector. I have implemented this by launching N thread blocks, as N is typically small in my target application, reaching 10000 at most. I may reconsider this design later, but at this moment, I am interested in coming up with a functional code.
However, I have come across a problem, and I believe that it may be because the different thread blocks are trying to access the same data, the vector x. My first attempt was this (very simple) code:
__global__ void dotp(float *res, const float *V, const float *x, uint d) {
uint i = blockIdx.x;
float r = 0.f;
const float *v = &V[i * d];
for (uint j = 0; j < d; ++j)
r += v[j] * x[j];
res[i] = r;
}
but it did not work (the results reported are incorrect). I have then tried to modify the kernel available in the SDK:
#define ACCUM_N 1024
__global__ void dotpSDK(float *res, const float *V, const float *x, uint N, uint d) {
/* Accumulators cache */
__shared__ float accumResult[ACCUM_N];
for(uint vec = blockIdx.x; vec < N; vec += gridDim.x){
uint vectorBase = __mul24(d, vec);
for (uint iAccum = threadIdx.x; iAccum < ACCUM_N; iAccum += blockDim.x){
float sum = 0;
for (uint pos = iAccum; pos < d; pos += ACCUM_N)
sum += V[vectorBase + pos] * x[pos];
accumResult[iAccum] = sum;
}
for (uint stride = ACCUM_N / 2; stride > 0; stride >>= 1){
__syncthreads();
for (uint iAccum = threadIdx.x; iAccum < stride; iAccum += blockDim.x)
accumResult[iAccum] += accumResult[stride + iAccum];
}
if (threadIdx.x == 0) res[vec] = accumResult[0];
}
}
where the following lines have been changed (a simple change of variables which should not affect the overall results):
for (uint pos = iAccum; pos < d; pos += ACCUM_N)
sum += V[vectorBase + pos] * x[pos];
which also did not work. Is the fact that I am allowing multiple thread blocks to access the same (read-only) data what is preventing the code from working??? After reading the programming guide, I was expecting only a performance penalty due to bank conflicts…
BTW, the entire source code can be found at Dotp.
Edit: the first function has been corrected