Memory Coalescing

Karl_Stathakis · July 24, 2009, 8:35pm

Hey all,

I’ve been trying to get a 3-D code to coalesce, but it gives me an error–“identifier “__eh_curr_region” is undefined”–at Line 23. My question is, why isn’t it enough for me to seperate “x-, y-, and zBlock” from “x/y/z”? This should have already ensured that no memory banks get out of order, but instead, the code won’t even execute. Any ideas?

Thanks to anyone who takes a look at this.

Karl

[codebox]global void FILENAME(float *VAR_device, float *ANS_device, int N)

{

shared float block[NNN];

int xBlock = blockIdx.x*blockDim.x;

int yBlock = blockIdx.y*blockDim.y;

int zBlock = blockIdx.z*blockDim.z;

int x = xBlock + threadIdx.x;

int y = yBlock + threadIdx.y;

int z = zBlock + threadIdx.z;

block[zNN + yN + x] = VAR_device[zNN + yN + x];

ANS_device[zNN + yN + x] = block[zNN + yN + x] + 1;

}[/codebox]

avidday · July 24, 2009, 9:01pm

Grids are only two dimensional. You can have a 3 dimensional block of threads (x=512,y=512 and z=64 maximum dimensions with a total of 512 threads per block), but grids are limited to 65535x65535.

Nico · July 24, 2009, 9:08pm

You can’t use the non-constant N value when creating the array. Define your shared array as extern and allocate its size as kernel<<griddim,blockdim,NNN*sizeof(float)>>>(…)

N.

Karl_Stathakis · July 27, 2009, 3:26pm

Thanks, guys; I believe I have the threads coalesced. That brings up two more questions, though:

First, the size of my 3-D matrix is limited to 8x8x8. However, I understand that there are 16,384 bytes of memory available per block; does this mean that I could use, for example, 1024 blocks in the grid and thus have 16.777MB of shared memory to use? (I suspect the answer is “no,” since having 16MB of shared memory seems a little too convenient considering that there’s only 512MB total memory). But if the answer is “yes,” how do I put 16,384 bytes of memory on each block and get that hypothetical 16.777MB?

Second, I ran the 8x8x8 coalesced and un-coalesced, and both took 0.19 seconds for 5000 iterations. It seems they’re the same speed because a 512-variable matrix is so small (plus the fact that it takes about a tenth of a second for CUDA to set everything up), but is there another reason?

avidday · July 27, 2009, 9:19pm

It is true you get roughly 16kb per block in shared memory (the kernel execution parameters also sit in shared memory so you loose a handful of bytes to that), but that 16kb is completely local to each block and, most importantly, only has the lifetime of the block it is associated with. So while you “virtually” have 1024 blocks with 16kb of shared memory each, it is only the running block or blocks (which is only a fraction of the total scheduled) that actually have data sitting in the silicon at any given moment. Any caching or staging of data in shared memory has to be local to the block and must fit into 16kb. No way around that on any current hardware, I am afraid.

Without knowing the way you are doing the timing, it is impossible to say. I would be very surprised if the coalesced version wasn’t measurably faster than the “naive” version. You probably should break out the profiler and have a look in more detail. It will confirm the kernel timings and show whether there are uncoalesced loads or stores.

Karl_Stathakis · July 28, 2009, 8:29pm

I did some more work on the coalescing; I now have a 12x12x12 running at 600+ MFLOPS. That is, as they say, progress. So thanks a bunch.

Of course, scaling a concept up is never as simple as it seems :wacko:. The code below is my attempt to run a 12x12x16 for 100,000 iterations. The only change is that its configuration is now “NxNxQ,” not NxNxN," but running it produces garbled answers. Is it even possible to run a coalesced matrix with non-equal dimensions (i.e. not “NxNxN”), or am I just missing something?

[codebox]global void FILENAME(float *VAR_device, float *ANS_device, int N, int D, int nIterations)

{

const int P = 12; //

const int Q = 16;

shared float block[QPP];

int xBlock = blockIdx.x*blockDim.x;

int yBlock = blockIdx.y*blockDim.y;

int zBlock = blockIdx.z*blockDim.z;

int x = xBlock + threadIdx.x;

int y = yBlock + threadIdx.y;

int z = zBlock + threadIdx.z;

for(int k=0; k<nIterations; k++)

{

  block[z*D*N + y*N + x] = VAR_device[z*D*N + y*N + x];

ANS_device[zDN + yN + x] = block[zDN + yN + x] + 1;

float *temp = ANS_device;

  ANS_device = VAR_device;

  VAR_device = temp;

}

int main()

{

float *ANS_device, *VAR_device;

int N = 12; //

int D = 16; //

int dimA = DNN;

int nIterations = 100000; //

int a = 16; //

int b = 12; //

float VAR_host[D][N][N], ANS_host[D][N][N];

cudaMalloc((void **)(&ANS_device), dimA*sizeof(float));

cudaMalloc((void **)(&VAR_device), dimA*sizeof(float));

for (int i=0; i<N; i++)

{

  for (int j=0; j<N; j++)

  {

     for (int k=0; k<N; k++)

     {

        VAR_host[i][j][k] = float(j)*float(i) + 1;			//

     }

  }

}

cudaMemcpy(ANS_device, VAR_host, dimA*sizeof(float), cudaMemcpyHostToDevice);

dim3 dimGrid(N*N/b, D/a);

dim3 dimBlock(a, B); // smiley face = “b”

FILENAME <<< dimGrid, dimBlock, NND*sizeof(float) >>> (ANS_device, VAR_device, N, D, nIterations);

cudaMemcpy(ANS_host, ANS_device, NND*sizeof(float), cudaMemcpyDeviceToHost);

cudaFree(VAR_device);

cudaFree(ANS_device);

return 0;

}[/codebox]

avidday · July 28, 2009, 8:50pm

You can coalesce with any block dimensions you want, as long as the access to the “major” dimension of your storage resolves into the requisite 16 word aligned, 16 words segments for coalescing. My 3D finite difference kernels usually run with a 32x16 block and use coalesced loads into a 32x16x 3, 5 or 7 shared memory area from a much larger column major ordered global memory array.

Karl_Stathakis · July 29, 2009, 3:08pm

OK, what I have now is a 12x12x16 matrix that the kernel seems to be taking 8x8x16 chunks from. I say “seems to be” because the only things I changed from the code posted above are the “shared float block[QPP]” (where Q=16 and P=8), and the size of my dataset and block/grid size.

Furthermore, when I try all but a couple of dataset sizes and grid/block dimensions, the code doesn’t work. When it works, I can input any dataset and algorithm I please, but when it doesn’t, it returns my input data as an answer.

This brings me to what you said about being able to coalesce blocks of varying dimension. How exactly is this concept implemented–specifically, what code do you use to allow your kernel to run 32x16xN segments, and how do you pull those pieces out of the dataset?

Topic		Replies	Views
Please help with __shared__ memory different usage than in samples CUDA Programming and Performance	30	3309	January 10, 2010
Coalesced Memory access related doubt CUDA Programming and Performance	13	1985	December 9, 2010
Memory coalescing and multiple arrays CUDA Programming and Performance	23	11666	March 20, 2009
Isn't that Coalesced?! writing to global memory in a coalesced way CUDA Programming and Performance	9	10168	June 28, 2009
About coalescing CUDA Programming and Performance	6	2614	April 16, 2010
cudaMemcpy into shared variables CUDA Programming and Performance	12	5261	September 23, 2009
Coalescing CUDA Programming and Performance	6	3571	May 12, 2008
Local memory performance Using more than 4kb kills it.. why? CUDA Programming and Performance	24	5076	September 6, 2008
help me! coalesced access CUDA Programming and Performance	15	7014	October 27, 2010
Trouble with memory coalescing in an AES implementation Also, register usage is higher than expected CUDA Programming and Performance	5	893	December 13, 2009

Memory Coalescing

Related topics