Thread blocks accessing the same (read-only) data

Hello everyone,

I am trying to write some code that, given an array V containing N d-dimensional vectors and another d-dimensional vector (say x), determines the dot product between the vectors in V and x. The array V is arranged so that V[0] through V[d - 1] store the components of the first vector. I have implemented this by launching N thread blocks, as N is typically small in my target application, reaching 10000 at most. I may reconsider this design later, but at this moment, I am interested in coming up with a functional code.

However, I have come across a problem, and I believe that it may be because the different thread blocks are trying to access the same data, the vector x. My first attempt was this (very simple) code:

__global__ void dotp(float *res, const float *V, const float *x, uint d) {

	uint i = blockIdx.x;

	float r = 0.f;

	const float *v = &V[i * d];

	for (uint j = 0; j < d; ++j)

  r += v[j] * x[j];

	res[i] = r;

}

but it did not work (the results reported are incorrect). I have then tried to modify the kernel available in the SDK:

#define ACCUM_N 1024

__global__ void dotpSDK(float *res, const float *V, const float *x, uint N, uint d) {

    /* Accumulators cache */

    __shared__ float accumResult[ACCUM_N];

    for(uint vec = blockIdx.x; vec < N; vec += gridDim.x){

        uint vectorBase = __mul24(d, vec);

        for (uint iAccum = threadIdx.x; iAccum < ACCUM_N; iAccum += blockDim.x){

            float sum = 0;

            for (uint pos = iAccum; pos < d; pos += ACCUM_N)

                sum += V[vectorBase + pos] * x[pos];

            accumResult[iAccum] = sum;

        }

        for (uint stride = ACCUM_N / 2; stride > 0; stride >>= 1){

            __syncthreads();

            for (uint iAccum = threadIdx.x; iAccum < stride; iAccum += blockDim.x)

                accumResult[iAccum] += accumResult[stride + iAccum];

        }

        if (threadIdx.x == 0) res[vec] = accumResult[0];

    }

}

where the following lines have been changed (a simple change of variables which should not affect the overall results):

for (uint pos = iAccum; pos < d; pos += ACCUM_N)

    sum += V[vectorBase + pos] * x[pos];

which also did not work. Is the fact that I am allowing multiple thread blocks to access the same (read-only) data what is preventing the code from working??? After reading the programming guide, I was expecting only a performance penalty due to bank conflicts…

BTW, the entire source code can be found at Dotp.

Edit: the first function has been corrected

__global__ void dotp(float *res, const float *V, const float *x, uint d) {

	uint i = blockIdx.x;

	float r = 0.f;

	const float *v = &V[i * d];

	for (uint j = 0; j < d; ++j)

  r += v[j] * x[j];

	*res = r;

}

only commenting on your first code you posted:

You are not using the threadIndex variable, meaning that all threads belonging to one block will access the same data and store their result in the same result location. So they will overwrite each other’s results… which anway are identical ;)

Christian

Actually the code is called as:

dotp<<<N, 1>>>()

so that this problem does not occur.

Edit: actually the first function is incorrect (I made a mistake when copying it from a larger code I was using :oops:). It should be:

__global__ void dotp(float *res, const float *V, const float *x, uint d) {

	uint i = blockIdx.x;

	float r = 0.f;

	const float *v = &V[i * d];

	for (uint j = 0; j < d; ++j)

  r += v[j] * x[j];

	res[i] = r;

}

which also does not produce the correct results. This has been corrected in the first post as well…