Memory Coalescing

Hey all,

I’ve been trying to get a 3-D code to coalesce, but it gives me an error–“identifier “__eh_curr_region” is undefined”–at Line 23. My question is, why isn’t it enough for me to seperate “x-, y-, and zBlock” from “x/y/z”? This should have already ensured that no memory banks get out of order, but instead, the code won’t even execute. Any ideas?

Thanks to anyone who takes a look at this.

Karl

[codebox]global void FILENAME(float *VAR_device, float *ANS_device, int N)

{

shared float block[NNN];

int xBlock = blockIdx.x*blockDim.x;

int yBlock = blockIdx.y*blockDim.y;

int zBlock = blockIdx.z*blockDim.z;

int x = xBlock + threadIdx.x;

int y = yBlock + threadIdx.y;

int z = zBlock + threadIdx.z;

block[zNN + yN + x] = VAR_device[zNN + yN + x];

ANS_device[zNN + yN + x] = block[zNN + yN + x] + 1;

}[/codebox]

Grids are only two dimensional. You can have a 3 dimensional block of threads (x=512,y=512 and z=64 maximum dimensions with a total of 512 threads per block), but grids are limited to 65535x65535.

You can’t use the non-constant N value when creating the array. Define your shared array as extern and allocate its size as kernel<<griddim,blockdim,NNN*sizeof(float)>>>(…)

N.

Thanks, guys; I believe I have the threads coalesced. That brings up two more questions, though:

First, the size of my 3-D matrix is limited to 8x8x8. However, I understand that there are 16,384 bytes of memory available per block; does this mean that I could use, for example, 1024 blocks in the grid and thus have 16.777MB of shared memory to use? (I suspect the answer is “no,” since having 16MB of shared memory seems a little too convenient considering that there’s only 512MB total memory). But if the answer is “yes,” how do I put 16,384 bytes of memory on each block and get that hypothetical 16.777MB?

Second, I ran the 8x8x8 coalesced and un-coalesced, and both took 0.19 seconds for 5000 iterations. It seems they’re the same speed because a 512-variable matrix is so small (plus the fact that it takes about a tenth of a second for CUDA to set everything up), but is there another reason?

It is true you get roughly 16kb per block in shared memory (the kernel execution parameters also sit in shared memory so you loose a handful of bytes to that), but that 16kb is completely local to each block and, most importantly, only has the lifetime of the block it is associated with. So while you “virtually” have 1024 blocks with 16kb of shared memory each, it is only the running block or blocks (which is only a fraction of the total scheduled) that actually have data sitting in the silicon at any given moment. Any caching or staging of data in shared memory has to be local to the block and must fit into 16kb. No way around that on any current hardware, I am afraid.

Without knowing the way you are doing the timing, it is impossible to say. I would be very surprised if the coalesced version wasn’t measurably faster than the “naive” version. You probably should break out the profiler and have a look in more detail. It will confirm the kernel timings and show whether there are uncoalesced loads or stores.

I did some more work on the coalescing; I now have a 12x12x12 running at 600+ MFLOPS. That is, as they say, progress. So thanks a bunch.

Of course, scaling a concept up is never as simple as it seems :wacko:. The code below is my attempt to run a 12x12x16 for 100,000 iterations. The only change is that its configuration is now “NxNxQ,” not NxNxN," but running it produces garbled answers. Is it even possible to run a coalesced matrix with non-equal dimensions (i.e. not “NxNxN”), or am I just missing something?

[codebox]global void FILENAME(float *VAR_device, float *ANS_device, int N, int D, int nIterations)

{

const int P = 12; //

const int Q = 16;

shared float block[QPP];

int xBlock = blockIdx.x*blockDim.x;

int yBlock = blockIdx.y*blockDim.y;

int zBlock = blockIdx.z*blockDim.z;

int x = xBlock + threadIdx.x;

int y = yBlock + threadIdx.y;

int z = zBlock + threadIdx.z;

for(int k=0; k<nIterations; k++)

{

  block[z*D*N + y*N + x] = VAR_device[z*D*N + y*N + x];

ANS_device[zDN + yN + x] = block[zDN + yN + x] + 1;

float *temp = ANS_device;

  ANS_device = VAR_device;

  VAR_device = temp;

}

}

int main()

{

float *ANS_device, *VAR_device;

int N = 12; //

int D = 16; //

int dimA = DNN;

int nIterations = 100000; //

int a = 16; //

int b = 12; //

float VAR_host[D][N][N], ANS_host[D][N][N];

cudaMalloc((void **)(&ANS_device), dimA*sizeof(float));

cudaMalloc((void **)(&VAR_device), dimA*sizeof(float));

for (int i=0; i<N; i++)

{

  for (int j=0; j<N; j++)

  {

     for (int k=0; k<N; k++)

     {

        VAR_host[i][j][k] = float(j)*float(i) + 1;			//

     }

  }

}

cudaMemcpy(ANS_device, VAR_host, dimA*sizeof(float), cudaMemcpyHostToDevice);

dim3 dimGrid(N*N/b, D/a);

dim3 dimBlock(a, B); // smiley face = “b”

FILENAME <<< dimGrid, dimBlock, NND*sizeof(float) >>> (ANS_device, VAR_device, N, D, nIterations);

cudaMemcpy(ANS_host, ANS_device, NND*sizeof(float), cudaMemcpyDeviceToHost);

cudaFree(VAR_device);

cudaFree(ANS_device);

return 0;

}[/codebox]

You can coalesce with any block dimensions you want, as long as the access to the “major” dimension of your storage resolves into the requisite 16 word aligned, 16 words segments for coalescing. My 3D finite difference kernels usually run with a 32x16 block and use coalesced loads into a 32x16x 3, 5 or 7 shared memory area from a much larger column major ordered global memory array.

OK, what I have now is a 12x12x16 matrix that the kernel seems to be taking 8x8x16 chunks from. I say “seems to be” because the only things I changed from the code posted above are the “shared float block[QPP]” (where Q=16 and P=8), and the size of my dataset and block/grid size.

Furthermore, when I try all but a couple of dataset sizes and grid/block dimensions, the code doesn’t work. When it works, I can input any dataset and algorithm I please, but when it doesn’t, it returns my input data as an answer.

This brings me to what you said about being able to coalesce blocks of varying dimension. How exactly is this concept implemented–specifically, what code do you use to allow your kernel to run 32x16xN segments, and how do you pull those pieces out of the dataset?