Global memory access problem Can't figure out how to do it correctly

andrew.ro · May 7, 2011, 9:44pm

Hi everyone!

It’s been about a week since I’m banging my head to get these issues fixed, so I’m counting for your help :)

So, I basically have an array of floats that I copy from host to device, do some computing on the array elements, then copy it back to host.

This is how I call the kernel:

// setup execution parameters

int block_size = 512;

dim3 dimBlock(block_size, 1, 1);

int n_blocks = N/1024 + (N%1024 == 0 ? 0:1);

dim3 dimGrid(n_blocks, 1, 1);

int blockGrpSize = 2;

// execute the kernel

mykernel<<< dimGrid, dimBlock >>>(a_d, N, blockGrpSize);

[...]

/**

* a - the array of size N - in global memory

* N - number of elements

* blockGrpSize - size of block group

*/

__global__ void mykernel(cuFloatComplex *a, int N, int blockGrpSize)

{

	// general purpose variables

	int x0_idx, x1_idx;

	// array stored in block's shared memory	

	__shared__ cuFloatComplex aS[1024];

			

	// computing block group's left member

	x0_idx = (blockIdx.x >= ((blockIdx.x / blockGrpSize)*blockGrpSize + blockGrpSize/2)) ? (blockIdx.x - blockGrpSize/2) : blockIdx.x;

	// computing block group's right member

	x1_idx = x0_idx + blockGrpSize/2;

		

	// assigning values to the array elements in the global memory in 512 batches

	// rememeber that blockDim.x = 512

	a[blockIdx.x*1024 + threadIdx.x].x = blockIdx.x * 2;

	a[blockIdx.x*1024 + threadIdx.x + blockDim.x].x = blockIdx.x * 2 + 1;

	

	// copy elements from global memory to shared memory

	aS[threadIdx.x] = a[blockIdx.x*1024 + threadIdx.x]; // 1st half of a block

	aS[threadIdx.x + blockDim.x] = a[blockIdx.x*1024 + threadIdx.x + blockDim.x]; // 2nd half of a block

	

	// copy elements back to global memory

	a[blockIdx.x*1024 + threadIdx.x] = aS[threadIdx.x];

	a[blockIdx.x*1024 + threadIdx.x + blockDim.x] = aS[threadIdx.x + blockDim.x];

	

	return;

}

Each kernel block will have 512 threads and for each block there are 1024 corresponding elements in the device’s global memory array, which I copy in the shared memory, do some computations on them, then copy them back.

If I run the above example for N=4096 (thus 4 blocks), basically I initialize the elements in the global memory array, copy them to the shared memory, do nothing to them, then copy them back to the global memory.

If I print out the resulting array, I will see the value

0 for elements with index 0 … 511 (1st half of block 0),

1 for elements with index 512 … 1023 (2nd half of block 0),

2 for elements with index 1024 … 1535 (1st half of block 1),

3 for elements with index 1536 … 2047 (2nd half of block 1),

4 for elements with index 2048 … 2559 (1st half of block 2),

5 for elements with index 2560 … 3071 (2nd half of block 2),

6 for elements with index 3072 … 3583 (1st half of block 3),

7 for elements with index 3584 … 4095 (2nd half of block 3).

All good so far, except that I need to combine the elements from these blocks. For this I use the 3rd parameter.

For example:

if blockGrpSize==2 – I make the following pairs: (block0, block1), (block2, block3)

if blockGrpSize==4 – I make the following pairs: (block0, block2), (block1, block3).

These pairs’ left and right block member’s indexes are computed in the following lines:

// computing block group's left member

	x0_idx = (blockIdx.x >= ((blockIdx.x / blockGrpSize)*blockGrpSize + blockGrpSize/2)) ? (blockIdx.x - blockGrpSize/2) : blockIdx.x;

	// computing block group's right member

	x1_idx = x0_idx + blockGrpSize/2;

By combining the elements in these blocks I mean replacing the right block’s first half elements with the left block’s second half elements and viceversa.

For example, if the elements in the left block are [0 0 1 1] and the ones in the right block are [2 2 3 3], by combining the blocks I would obtain [0 0 2 2] in the left block and [1 1 3 3] in the right block.

__global__ void mykernel(cuFloatComplex *a, int N, int blockGrpSize)

{

	// general purpose variables

	int x0_idx, x1_idx;

	// array stored in block's shared memory	

	__shared__ cuFloatComplex aS[1024];

			

	// computing block group's left member

	x0_idx = (blockIdx.x >= ((blockIdx.x / blockGrpSize)*blockGrpSize + blockGrpSize/2)) ? (blockIdx.x - blockGrpSize/2) : blockIdx.x;

	// computing block group's right member

	x1_idx = x0_idx + blockGrpSize/2;

		

	// assigning values to the array elements in the global memory in 512 batches

	// rememeber that blockDim.x = 512

	a[blockIdx.x*1024 + threadIdx.x].x = blockIdx.x * 2;

	a[blockIdx.x*1024 + threadIdx.x + blockDim.x].x = blockIdx.x * 2 + 1;

	

	if (blockIdx.x==x0_idx) {

           // code for the left block

		// 1st half of the left block

		aS[threadIdx.x] = a[x0_idx*1024 + threadIdx.x];

		// 1st half of the right block

		aS[threadIdx.x + blockDim.x] = a[x1_idx*1024 + threadIdx.x];

	} else {

           // code for the right block

		// 2nd half of the left block

		aS[threadIdx.x] = a[x0_idx*1024 + threadIdx.x + blockDim.x];

		// 2nd half of the right block

		aS[threadIdx.x + blockDim.x] = a[x1_idx*1024 + threadIdx.x + blockDim.x];

	}

	__syncthreads();

	

	// copy elements back to global memory

	a[blockIdx.x*1024 + threadIdx.x] = aS[threadIdx.x];

	a[blockIdx.x*1024 + threadIdx.x + blockDim.x] = aS[threadIdx.x + blockDim.x];

	

	return;

}

For blockGrpSize=2 I get the correct result, ie:

the block pairs are (0,1) and (2,3), so by switching the left block’s 2nd half elements with the right block’s 1st half elements I get

0 for elements with index 0 … 511 (1st half of block 0),

2 for elements with index 512 … 1023 (2nd half of block 0),

1 for elements with index 1024 … 1535 (1st half of block 1),

3 for elements with index 1536 … 2047 (2nd half of block 1),

4 for elements with index 2048 … 2559 (1st half of block 2),

6 for elements with index 2560 … 3071 (2nd half of block 2),

5 for elements with index 3072 … 3583 (1st half of block 3),

7 for elements with index 3584 … 4095 (2nd half of block 3).

For blockGrpSize=4 I should get the following result:

the block pairs are (0,2) and (1,3), so by switching the left block’s 2nd half elements with the right block’s 1st half elements I should get

0 for elements with index 0 … 511 (1st half of block 0),

4 for elements with index 512 … 1023 (2nd half of block 0),

2 for elements with index 1024 … 1535 (1st half of block 1),

6 for elements with index 1536 … 2047 (2nd half of block 1),

1 for elements with index 2048 … 2559 (1st half of block 2),

5 for elements with index 2560 … 3071 (2nd half of block 2),

3 for elements with index 3072 … 3583 (1st half of block 3),

7 for elements with index 3584 … 4095 (2nd half of block 3).

But instead I get

0 for elements with index 0 … 511 (1st half of block 0),

0 for elements with index 512 … 1023 (2nd half of block 0),

2 for elements with index 1024 … 1535 (1st half of block 1),

0 for elements with index 1536 … 2047 (2nd half of block 1),

0 for elements with index 2048 … 2559 (1st half of block 2),

5 for elements with index 2560 … 3071 (2nd half of block 2),

0 for elements with index 3072 … 3583 (1st half of block 3),

7 for elements with index 3584 … 4095 (2nd half of block 3).

My question now is what am I doing wrong when I copy the elements from the global to the shared memory? It seems to be working ok for blockGrpSize=2. I also checked that pairs’ block indexes are computed correctly (x0_idx and x1_idx). My guess is that it’s a syncronization problem?

PS: Here is the output of deviceQuery, in case you should know:

There is 1 device supporting CUDA

Device 0: "GeForce 8500 GT"

  CUDA Driver Version:                           3.20

  CUDA Runtime Version:                          3.20

  CUDA Capability Major/Minor version number:    1.1

  Total amount of global memory:                 536150016 bytes

  Multiprocessors x Cores/MP = Cores:            2 (MP) x 8 (Cores/MP) = 16 (Cores)

  Total amount of constant memory:               65536 bytes

  Total amount of shared memory per block:       16384 bytes

  Total number of registers available per block: 8192

  Warp size:                                     32

  Maximum number of threads per block:           512

  Maximum sizes of each dimension of a block:    512 x 512 x 64

  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1

  Maximum memory pitch:                          2147483647 bytes

  Texture alignment:                             256 bytes

  Clock rate:                                    0.92 GHz

  Concurrent copy and execution:                 Yes

  Run time limit on kernels:                     Yes

  Integrated:                                    No

  Support host page-locked memory mapping:       Yes

  Compute mode:                                  Default (multiple host threads can use this device simultaneously)

  Concurrent kernel execution:                   No

  Device has ECC support enabled:                No

  Device is using TCC driver mode:               No

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 3.20, CUDA Runtime Version = 3.20, NumDevs = 1, Device = GeForce 8500 GT

PASSED

Thank you for taking the time to read my post! I will appreciate your answers.

Andrew.

kbam · May 9, 2011, 4:52am

I think the problem may be here

This writes to global memory

a[blockIdx.x*1024 + threadIdx.x + blockDim.x].x = blockIdx.x * 2 + 1;

and you immediately read back from global memory here

// 1st half of the left block

     aS[threadIdx.x] = a[x0_idx*1024 + threadIdx.x];

     // 1st half of the right block

     aS[threadIdx.x + blockDim.x] = a[x1_idx*1024 + threadIdx.x];

....

But if you have more blocks than your GPU can fit onto MP at once then second chunk of code can at times be trying to read data before it has been written, as the block that writes the data has even started on an MP yet.

Imagine what will happen if there is a large number of blocks, say 20000

I suspect that the first line of code is just setting up data for debugging, if so do that with another kernel.

andrew.ro · May 9, 2011, 12:09pm

Thanks for your answer,

I’ve figured that it was a block syncronization issue, so the solution I’ve found was to only work with the global memory assigned to a block within a kernel, then launch the kernel again so as to read/write the global memory previously assigned to other blocks in the previous kernel.

Topic		Replies	Views
Shared memory vs global memory CUDA Programming and Performance	6	3460	April 30, 2007
Copying data into shared memory CUDA Programming and Performance	9	3765	July 1, 2009
inter-block communication via global memory why my simple implementation failed? CUDA Programming and Performance	7	14439	December 4, 2007
memcpy equivalent for global memory to shared memo CUDA Programming and Performance	5	9245	November 12, 2007
problem with copy from global to shared and back CUDA Programming and Performance	5	2793	April 6, 2009
memory function does not see the memory, although the device array is copied CUDA Programming and Performance	2	2561	May 29, 2010
beginners problem - global memory damage? CUDA Programming and Performance	5	1891	September 23, 2008
Global Memory Reading Problem CUDA Programming and Performance	4	1420	October 10, 2009
global memory consistency same address accessed by multiple blocks CUDA Programming and Performance	1	1053	March 12, 2010
copying to shared block mem CUDA Programming and Performance	11	4205	April 6, 2008

Global memory access problem Can't figure out how to do it correctly

Related topics