Possible memory conflicts?

I am attempting to write a kernel that performs the following function on a 2D array:

For each row, divide all colums by column 0, sum the result, take the inverse of the result, and place the result in column 0 of the output. Next, do the same thing for column 1 and place the results in column 1 output. For example, if there is a matrix TxK, where K=3, then the calculation for out[0,0] would be:

out[0,0] = 1 / (in[0,0]/in[0,0] + in[0,1]/in[0,0] + in[0,2]/in[0,0])

… …

out[T,0] = 1 / (in[T,0]/in[T,0] + in[T,1]/in[T,0] + in[T,2]/in[T,0])

Currently, I have implemented the following kernel, which I loop over K times.

__global__ void updateUFCM(float *U, int nT, int nK) {

	int kk = blockIdx.x * blockDim.x + threadIdx.x;

	int tt = blockIdx.y * blockDim.y + threadIdx.y;

	if (tt < nT && kk < nK) {

		float sum = 0.f;

		float dk = tex2D(tDX, kk, tt);

		float di = 0.f;

		for (int ii=0; ii < nK; ii++) {

			di = tex2D(tDX, ii, tt);

			sum += ( dk/di );

		}

		float u = 1.0f / sum;

		U[tt * nK + kk] = u;

	}

}

Columns 1 and 2 are working out fine. However, column 0 (which should be all 1s in the case of the K=0 loop) has values that are not 1 in random places.

Could there be memory problems with threads trying to access the same memory address within a block? I am planning to reimplement the code so that the grid.x dimension will be set to the number of Ks. However, this will limit the K value to 512, which would be OK for now.

Is there anything that jumps out as being a really bad idea, other than the fact that it is a naive newbie approach?

I have checked the input matrix, and it appears fine.

Thanks.

It is probably the texture reads. The 0 column might well be “interpolating off the edge of the earth”, depending on how you have configured texture. Those (x,y) tuples for the 2D texture are supposed to be floats, and are notionally cell centred, so there could be some integer truncation issues which mean the reads are not return you are hoping they return, especially at the boundaries of the texture data.

Thanks for your reply. I wondered about that, and made a version that just reads from global memory. In the global memory version, I also split the kernel into two kernels. One to do the division, and one for the reduction. This is how I noticed the values != 1 in column 0.

Unfortunately, the problem was there with global memory as well.

I’m using CUDA 2.2 running on CentOS.

I noticed that changing the block size seems to affect where the bad values appear. I started with 16x16 blocks, which had a bad value in say row 8. When I used a 4x4 block just to try something close to 3, the bad values changed to different rows.

I am calling the kernel like this. The texture memory version doesn’t require a loop on the CPU. I was confusing that version with the global memory version. Sorry if that was confusing.

dim3 threads(16,16);

dim3 grid;

grid.x = static_cast<int>( ceil( static_cast<float>(self->nK) / threads.x));

grid.y = static_cast<int>( ceil( static_cast<float>(self->nT) / threads.y));

updateUFCM<<<grid, threads>>>(d_U, self->nT, self->nK);

I was wondering if the matrix width of 3 could be causing some problems.

Alright, time to fess up. I was using <= in a spot where the operator should have been <. So in effect, you were right avidday. It was reading off the end of the array. Thanks.