gst_incoherent

Using Nvidia’s visual profiler, all my global stores are incoherent. The outFrame function is allocated with cudaMallocPitch.

host code
dim3 dimGrid(10, 16, 1);
dim3 dimBlock(16, 16, 1);
cudaMallocPitch((void**)&dresult, &pitch, 3202, 256);
kernel code
// outFrame is a short2

globalpos = (threadIdx.y + BLOCKSIZE*blockIdx.y)pitch/sizeof(short2);
globalpos += (threadIdx.x + blockIdx.x
BLOCKSIZE);
outFrame[globalpos] = tmp;

Why are the writes incoherent?

When I comment out all the code in the kernel except the store, the stores are coalesced. Looking at the .ptx IL, I can see that two separate stores are done instead of one store with a short2 vector. So the question, geared mainly to Nvidia, could there be a reason for the compiler to use non-coalesced memory accesses or is this definitely a bug?

What types are globalpos and tmp? globalpos = *short2 and tmp = short2?

Is pitch in bytes or short2’s? (It’s in bytes, right?)

Odd, there have been two posts about this in the last couple days, one with uchar4 and one short2. I myself have never seen this behavior, but then I’ve only ever used float4s or floats.

As you say, it seems to be a compiler bug. Please post a minimal source code file that exhibits this behavior. Someone from NVIDIA will probably read this and submit it as a bug, and if not I will.

In the meantime, try this just before the write:

short2 write_tmp;

write_tmp.x = tmp.x;

write_tmp.y = tmp.y;

outFrame[globalpos] = write_tmp;

It might just fool the compiler into performing a single store. Then again, it might not. But it is worth a try.

Here’s the full source code for the kernel.

global void
Convolution(short2* inFrame, short2* outFrame, int pitch)
{
shared short2 shared[(BLOCKSIZE+2)(BLOCKSIZE+2)];
int sharedpos = 1 + threadIdx.x + (threadIdx.y + 1)18;
int globalpos = 16 + blockIdx.x
BLOCKSIZE + threadIdx.x;
globalpos += (1 + threadIdx.y + blockIdx.y
BLOCKSIZE)*DIFFROWWIDTH;
// read in the core block
shared[sharedpos] = inFrame[globalpos];
// read in the top row
if (threadIdx.y == 0)
{
shared[1 + threadIdx.x] = inFrame[globalpos-DIFFROWWIDTH];
}
// read in the bottom row
if (threadIdx.y == BLOCKSIZE - 1)
shared[(BLOCKSIZE+1)*18 + 1 + threadIdx.x] = inFrame[globalpos + DIFFROWWIDTH];

// read in the left column
if (threadIdx.x == 0)
shared[(1+threadIdx.y)*18] = inFrame[globalpos - 1];
// read the right column
if (threadIdx.x == BLOCKSIZE - 1)
shared[(1+threadIdx.y)*18 + 17] = inFrame[globalpos+1];
// read the four corners
if (threadIdx.x == 0 && threadIdx.y == 0)
shared[0] = inFrame[globalpos - (DIFFROWWIDTH + 1)];
if (threadIdx.x == BLOCKSIZE - 1 && threadIdx.y == 0)
shared[17] = inFrame[globalpos - (DIFFROWWIDTH - 1)];
if (threadIdx.x == 0 && threadIdx.y == BLOCKSIZE - 1)
shared[(BLOCKSIZE + 1)18] = inFrame[globalpos + (DIFFROWWIDTH - 1)];
if (threadIdx.x == BLOCKSIZE - 1 && threadIdx.y == BLOCKSIZE - 1)
shared[(BLOCKSIZE+2)
(BLOCKSIZE+2) - 1] = inFrame[globalpos + (DIFFROWWIDTH + 1)];
__syncthreads();

int result;
short2 tmp;
// calculate tmp.x
result = __mul24(13, shared[sharedpos].x);
result += __mul24(10, shared[sharedpos - 1].y);
result += __mul24(10, shared[sharedpos].y);
result += __mul24(10, shared[sharedpos - 18].x);
result += __mul24(10, shared[sharedpos + 18].x);
result += shared[sharedpos-18].y;
result += shared[sharedpos - 19].y;
result += shared[sharedpos + 17].y;
result += shared[sharedpos + 18].y;
result /= 57;
tmp.x = (short)result;
// calculate tmp.y
result = __mul24(13, shared[sharedpos].y);
result += __mul24(10, shared[sharedpos].x);
result += __mul24(10, shared[sharedpos+1].x);
result += __mul24(10, shared[sharedpos - 18].y);
result += __mul24(10, shared[sharedpos + 18].y);
result += shared[sharedpos-18].x;
result += shared[sharedpos - 17].x;
result += shared[sharedpos + 18].x;
result += shared[sharedpos + 19].x;
result /= 57;
tmp.y = (short)result;
// assign result to outFrame
globalpos = (threadIdx.y + blockDim.y*blockIdx.y)pitch/sizeof(short2);
globalpos += (threadIdx.x + blockIdx.x
BLOCKSIZE);
outFrame[globalpos] = tmp;
}

…and here is the link to ‘uchar 4’ uncoalesced store / optimisation bug:

http://forums.nvidia.com/index.php?showtopic=62961