gst_incoherent

chris22 · March 24, 2008, 2:17am

Using Nvidia’s visual profiler, all my global stores are incoherent. The outFrame function is allocated with cudaMallocPitch.

host code
dim3 dimGrid(10, 16, 1);
dim3 dimBlock(16, 16, 1);
cudaMallocPitch((void**)&dresult, &pitch, 3202, 256);
kernel code
// outFrame is a short2
globalpos = (threadIdx.y + BLOCKSIZE*blockIdx.y)pitch/sizeof(short2);
globalpos += (threadIdx.x + blockIdx.xBLOCKSIZE);
outFrame[globalpos] = tmp;

Why are the writes incoherent?

chris22 · March 24, 2008, 5:03am

When I comment out all the code in the kernel except the store, the stores are coalesced. Looking at the .ptx IL, I can see that two separate stores are done instead of one store with a short2 vector. So the question, geared mainly to Nvidia, could there be a reason for the compiler to use non-coalesced memory accesses or is this definitely a bug?

kristleifur · March 24, 2008, 11:59am

What types are globalpos and tmp? globalpos = *short2 and tmp = short2?

Is pitch in bytes or short2’s? (It’s in bytes, right?)

MisterAnderson42 · March 24, 2008, 1:19pm

Odd, there have been two posts about this in the last couple days, one with uchar4 and one short2. I myself have never seen this behavior, but then I’ve only ever used float4s or floats.

As you say, it seems to be a compiler bug. Please post a minimal source code file that exhibits this behavior. Someone from NVIDIA will probably read this and submit it as a bug, and if not I will.

In the meantime, try this just before the write:

short2 write_tmp;

write_tmp.x = tmp.x;

write_tmp.y = tmp.y;

outFrame[globalpos] = write_tmp;

It might just fool the compiler into performing a single store. Then again, it might not. But it is worth a try.

chris22 · March 24, 2008, 5:37pm

Here’s the full source code for the kernel.

global void
Convolution(short2* inFrame, short2* outFrame, int pitch)
{
shared short2 shared[(BLOCKSIZE+2)(BLOCKSIZE+2)];
int sharedpos = 1 + threadIdx.x + (threadIdx.y + 1)18;
int globalpos = 16 + blockIdx.xBLOCKSIZE + threadIdx.x;
globalpos += (1 + threadIdx.y + blockIdx.yBLOCKSIZE)*DIFFROWWIDTH;
// read in the core block
shared[sharedpos] = inFrame[globalpos];
// read in the top row
if (threadIdx.y == 0)
{
shared[1 + threadIdx.x] = inFrame[globalpos-DIFFROWWIDTH];
}
// read in the bottom row
if (threadIdx.y == BLOCKSIZE - 1)
shared[(BLOCKSIZE+1)*18 + 1 + threadIdx.x] = inFrame[globalpos + DIFFROWWIDTH];

// read in the left column
if (threadIdx.x == 0)
shared[(1+threadIdx.y)*18] = inFrame[globalpos - 1];
// read the right column
if (threadIdx.x == BLOCKSIZE - 1)
shared[(1+threadIdx.y)*18 + 17] = inFrame[globalpos+1];
// read the four corners
if (threadIdx.x == 0 && threadIdx.y == 0)
shared[0] = inFrame[globalpos - (DIFFROWWIDTH + 1)];
if (threadIdx.x == BLOCKSIZE - 1 && threadIdx.y == 0)
shared[17] = inFrame[globalpos - (DIFFROWWIDTH - 1)];
if (threadIdx.x == 0 && threadIdx.y == BLOCKSIZE - 1)
shared[(BLOCKSIZE + 1)18] = inFrame[globalpos + (DIFFROWWIDTH - 1)];
if (threadIdx.x == BLOCKSIZE - 1 && threadIdx.y == BLOCKSIZE - 1)
shared[(BLOCKSIZE+2)(BLOCKSIZE+2) - 1] = inFrame[globalpos + (DIFFROWWIDTH + 1)];
__syncthreads();

int result;
short2 tmp;
// calculate tmp.x
result = __mul24(13, shared[sharedpos].x);
result += __mul24(10, shared[sharedpos - 1].y);
result += __mul24(10, shared[sharedpos].y);
result += __mul24(10, shared[sharedpos - 18].x);
result += __mul24(10, shared[sharedpos + 18].x);
result += shared[sharedpos-18].y;
result += shared[sharedpos - 19].y;
result += shared[sharedpos + 17].y;
result += shared[sharedpos + 18].y;
result /= 57;
tmp.x = (short)result;
// calculate tmp.y
result = __mul24(13, shared[sharedpos].y);
result += __mul24(10, shared[sharedpos].x);
result += __mul24(10, shared[sharedpos+1].x);
result += __mul24(10, shared[sharedpos - 18].y);
result += __mul24(10, shared[sharedpos + 18].y);
result += shared[sharedpos-18].x;
result += shared[sharedpos - 17].x;
result += shared[sharedpos + 18].x;
result += shared[sharedpos + 19].x;
result /= 57;
tmp.y = (short)result;
// assign result to outFrame
globalpos = (threadIdx.y + blockDim.y*blockIdx.y)pitch/sizeof(short2);
globalpos += (threadIdx.x + blockIdx.xBLOCKSIZE);
outFrame[globalpos] = tmp;
}

m11 · March 24, 2008, 7:27pm

…and here is the link to ‘uchar 4’ uncoalesced store / optimisation bug:

[url=“The Official NVIDIA Forums | NVIDIA”]The Official NVIDIA Forums | NVIDIA

Topic		Replies	Views
Problem with coalesced memory access CUDA Programming and Performance	2	2788	June 23, 2008
Isn't that Coalesced?! writing to global memory in a coalesced way CUDA Programming and Performance	9	10232	June 28, 2009
cuda profiler error about coalesced store CUDA Programming and Performance	2	1110	January 6, 2010
Help with coalescing CUDA Programming and Performance	0	2726	March 12, 2008
read from global mem vs write to global mem CUDA Programming and Performance	13	6512	January 22, 2009
An example of coalesced memory access CUDA Programming and Performance	2	3685	June 28, 2010
Aligned structures problem CUDA Programming and Performance	1	1298	July 31, 2008
Need help on non-coalesced access CUDA Programming and Performance	0	1137	May 9, 2009
Uncoalesced reads; Coalesced writes Same access pattern; differenct coalesced I/O outcome? CUDA Programming and Performance	5	3265	December 12, 2011
Problems writing ConvolutionResult back to global memory Optimizing CUDA by coalescing CUDA Programming and Performance	2	2301	March 24, 2009

gst_incoherent

Related topics