What am I doing wrong? Thread blocks accessing the same data

I am trying to write some code that, given N d-dimensional vectors v_1, v_2, …, v_N, and another d-dimensional vector x, determines the dot products between v_i and x, from i=1 to N. My first idea was to create N thread blocks, so that the i-th thread block determines the dot product between v_i and x (called as dotp<<<N, 1>>>):

__global__ void dotp(float *res, const float *V, const float *x, uint d) {

uint i = blockIdx.x;

float r = 0.f;

const float *v = &V[i * d];

for (uint j = 0; j < d; ++j)

 r += v[j] * x[j];

res[i] = r;


where the array V is arranged so that V[0] through V[d - 1] store the components of the first vector. I am pretty sure this is not the best design, but, right now, I am interested in coming up with an initial functional code. However, I have found that the results determined by the above kernel are not correct. Is it because of the way x is accessed by all N thread blocks? I have also tried modifying the dot product example in the SDK (called by dotpSDK<<<128, 256>>>):

#define ACCUM_N 1024

__global__ void dotpSDK(float *res, const float *V, const float *x, uint N, uint d) {

   /* Accumulators cache */

   __shared__ float accumResult[ACCUM_N];

   for(uint vec = blockIdx.x; vec < N; vec += gridDim.x){

       uint vectorBase = __mul24(d, vec);

       for (uint iAccum = threadIdx.x; iAccum < ACCUM_N; iAccum += blockDim.x){

           float sum = 0;

           for (uint pos = iAccum; pos < d; pos += ACCUM_N)

               sum += V[vectorBase + pos] * x[pos];

           accumResult[iAccum] = sum;


       for (uint stride = ACCUM_N / 2; stride > 0; stride >>= 1){


           for (uint iAccum = threadIdx.x; iAccum < stride; iAccum += blockDim.x)

               accumResult[iAccum] += accumResult[stride + iAccum];


       if (threadIdx.x == 0) res[vec] = accumResult[0];



so that the second operand is the vector x, by modifying (through a change of variables) only the following two lines:

for (uint pos = iAccum; pos < d; pos += ACCUM_N)

   sum += V[vectorBase + pos] * x[pos];

but the results are also not correct. Note that this kernel also shares the vector x among all thread blocks in a similar (but not identical) way to the first, which makes me believe this may be what is prevent this code from working. Am I right? The full source code can be found here.