I am trying to write some code that, given N d-dimensional vectors v_1, v_2, …, v_N, and another d-dimensional vector x, determines the dot products between v_i and x, from i=1 to N. My first idea was to create N thread blocks, so that the i-th thread block determines the dot product between v_i and x (called as dotp<<<N, 1>>>):

```
__global__ void dotp(float *res, const float *V, const float *x, uint d) {
uint i = blockIdx.x;
float r = 0.f;
const float *v = &V[i * d];
for (uint j = 0; j < d; ++j)
r += v[j] * x[j];
res[i] = r;
}
```

where the array V is arranged so that V[0] through V[d - 1] store the components of the first vector. I am pretty sure this is not the best design, but, right now, I am interested in coming up with an initial functional code. However, I have found that the results determined by the above kernel are not correct. Is it because of the way x is accessed by all N thread blocks? I have also tried modifying the dot product example in the SDK (called by dotpSDK<<<128, 256>>>):

```
#define ACCUM_N 1024
__global__ void dotpSDK(float *res, const float *V, const float *x, uint N, uint d) {
/* Accumulators cache */
__shared__ float accumResult[ACCUM_N];
for(uint vec = blockIdx.x; vec < N; vec += gridDim.x){
uint vectorBase = __mul24(d, vec);
for (uint iAccum = threadIdx.x; iAccum < ACCUM_N; iAccum += blockDim.x){
float sum = 0;
for (uint pos = iAccum; pos < d; pos += ACCUM_N)
sum += V[vectorBase + pos] * x[pos];
accumResult[iAccum] = sum;
}
for (uint stride = ACCUM_N / 2; stride > 0; stride >>= 1){
__syncthreads();
for (uint iAccum = threadIdx.x; iAccum < stride; iAccum += blockDim.x)
accumResult[iAccum] += accumResult[stride + iAccum];
}
if (threadIdx.x == 0) res[vec] = accumResult[0];
}
}
```

so that the second operand is the vector x, by modifying (through a change of variables) only the following two lines:

```
for (uint pos = iAccum; pos < d; pos += ACCUM_N)
sum += V[vectorBase + pos] * x[pos];
```

but the results are also not correct. Note that this kernel also shares the vector x among all thread blocks in a similar (but not identical) way to the first, which makes me believe this may be what is prevent this code from working. Am I right? The full source code can be found here.