Global memory access problem Can't figure out how to do it correctly

Hi everyone!

It’s been about a week since I’m banging my head to get these issues fixed, so I’m counting for your help :)

So, I basically have an array of floats that I copy from host to device, do some computing on the array elements, then copy it back to host.

This is how I call the kernel:

// setup execution parameters

int block_size = 512;

dim3 dimBlock(block_size, 1, 1);

int n_blocks = N/1024 + (N%1024 == 0 ? 0:1);

dim3 dimGrid(n_blocks, 1, 1);

int blockGrpSize = 2;

// execute the kernel

mykernel<<< dimGrid, dimBlock >>>(a_d, N, blockGrpSize);

[...]

/**

* a - the array of size N - in global memory

* N - number of elements

* blockGrpSize - size of block group

*/

__global__ void mykernel(cuFloatComplex *a, int N, int blockGrpSize)

{

	// general purpose variables

	int x0_idx, x1_idx;

	// array stored in block's shared memory	

	__shared__ cuFloatComplex aS[1024];

			

	// computing block group's left member

	x0_idx = (blockIdx.x >= ((blockIdx.x / blockGrpSize)*blockGrpSize + blockGrpSize/2)) ? (blockIdx.x - blockGrpSize/2) : blockIdx.x;

	// computing block group's right member

	x1_idx = x0_idx + blockGrpSize/2;

		

	// assigning values to the array elements in the global memory in 512 batches

	// rememeber that blockDim.x = 512

	a[blockIdx.x*1024 + threadIdx.x].x = blockIdx.x * 2;

	a[blockIdx.x*1024 + threadIdx.x + blockDim.x].x = blockIdx.x * 2 + 1;

	

	// copy elements from global memory to shared memory

	aS[threadIdx.x] = a[blockIdx.x*1024 + threadIdx.x]; // 1st half of a block

	aS[threadIdx.x + blockDim.x] = a[blockIdx.x*1024 + threadIdx.x + blockDim.x]; // 2nd half of a block

	

	// copy elements back to global memory

	a[blockIdx.x*1024 + threadIdx.x] = aS[threadIdx.x];

	a[blockIdx.x*1024 + threadIdx.x + blockDim.x] = aS[threadIdx.x + blockDim.x];

	

	return;

}

Each kernel block will have 512 threads and for each block there are 1024 corresponding elements in the device’s global memory array, which I copy in the shared memory, do some computations on them, then copy them back.

If I run the above example for N=4096 (thus 4 blocks), basically I initialize the elements in the global memory array, copy them to the shared memory, do nothing to them, then copy them back to the global memory.

If I print out the resulting array, I will see the value

0 for elements with index 0 … 511 (1st half of block 0),

1 for elements with index 512 … 1023 (2nd half of block 0),

2 for elements with index 1024 … 1535 (1st half of block 1),

3 for elements with index 1536 … 2047 (2nd half of block 1),

4 for elements with index 2048 … 2559 (1st half of block 2),

5 for elements with index 2560 … 3071 (2nd half of block 2),

6 for elements with index 3072 … 3583 (1st half of block 3),

7 for elements with index 3584 … 4095 (2nd half of block 3).

All good so far, except that I need to combine the elements from these blocks. For this I use the 3rd parameter.

For example:

if blockGrpSize==2 – I make the following pairs: (block0, block1), (block2, block3)

if blockGrpSize==4 – I make the following pairs: (block0, block2), (block1, block3).

These pairs’ left and right block member’s indexes are computed in the following lines:

// computing block group's left member

	x0_idx = (blockIdx.x >= ((blockIdx.x / blockGrpSize)*blockGrpSize + blockGrpSize/2)) ? (blockIdx.x - blockGrpSize/2) : blockIdx.x;

	// computing block group's right member

	x1_idx = x0_idx + blockGrpSize/2;

By combining the elements in these blocks I mean replacing the right block’s first half elements with the left block’s second half elements and viceversa.

For example, if the elements in the left block are [0 0 1 1] and the ones in the right block are [2 2 3 3], by combining the blocks I would obtain [0 0 2 2] in the left block and [1 1 3 3] in the right block.

__global__ void mykernel(cuFloatComplex *a, int N, int blockGrpSize)

{

	// general purpose variables

	int x0_idx, x1_idx;

	// array stored in block's shared memory	

	__shared__ cuFloatComplex aS[1024];

			

	// computing block group's left member

	x0_idx = (blockIdx.x >= ((blockIdx.x / blockGrpSize)*blockGrpSize + blockGrpSize/2)) ? (blockIdx.x - blockGrpSize/2) : blockIdx.x;

	// computing block group's right member

	x1_idx = x0_idx + blockGrpSize/2;

		

	// assigning values to the array elements in the global memory in 512 batches

	// rememeber that blockDim.x = 512

	a[blockIdx.x*1024 + threadIdx.x].x = blockIdx.x * 2;

	a[blockIdx.x*1024 + threadIdx.x + blockDim.x].x = blockIdx.x * 2 + 1;

	

	if (blockIdx.x==x0_idx) {

           // code for the left block

		// 1st half of the left block

		aS[threadIdx.x] = a[x0_idx*1024 + threadIdx.x];

		// 1st half of the right block

		aS[threadIdx.x + blockDim.x] = a[x1_idx*1024 + threadIdx.x];

	} else {

           // code for the right block

		// 2nd half of the left block

		aS[threadIdx.x] = a[x0_idx*1024 + threadIdx.x + blockDim.x];

		// 2nd half of the right block

		aS[threadIdx.x + blockDim.x] = a[x1_idx*1024 + threadIdx.x + blockDim.x];

	}

	__syncthreads();

	

	// copy elements back to global memory

	a[blockIdx.x*1024 + threadIdx.x] = aS[threadIdx.x];

	a[blockIdx.x*1024 + threadIdx.x + blockDim.x] = aS[threadIdx.x + blockDim.x];

	

	return;

}

For blockGrpSize=2 I get the correct result, ie:

  • the block pairs are (0,1) and (2,3), so by switching the left block’s 2nd half elements with the right block’s 1st half elements I get

0 for elements with index 0 … 511 (1st half of block 0),

2 for elements with index 512 … 1023 (2nd half of block 0),

1 for elements with index 1024 … 1535 (1st half of block 1),

3 for elements with index 1536 … 2047 (2nd half of block 1),

4 for elements with index 2048 … 2559 (1st half of block 2),

6 for elements with index 2560 … 3071 (2nd half of block 2),

5 for elements with index 3072 … 3583 (1st half of block 3),

7 for elements with index 3584 … 4095 (2nd half of block 3).

For blockGrpSize=4 I should get the following result:

  • the block pairs are (0,2) and (1,3), so by switching the left block’s 2nd half elements with the right block’s 1st half elements I should get

0 for elements with index 0 … 511 (1st half of block 0),

4 for elements with index 512 … 1023 (2nd half of block 0),

2 for elements with index 1024 … 1535 (1st half of block 1),

6 for elements with index 1536 … 2047 (2nd half of block 1),

1 for elements with index 2048 … 2559 (1st half of block 2),

5 for elements with index 2560 … 3071 (2nd half of block 2),

3 for elements with index 3072 … 3583 (1st half of block 3),

7 for elements with index 3584 … 4095 (2nd half of block 3).

But instead I get

0 for elements with index 0 … 511 (1st half of block 0),

0 for elements with index 512 … 1023 (2nd half of block 0),

2 for elements with index 1024 … 1535 (1st half of block 1),

0 for elements with index 1536 … 2047 (2nd half of block 1),

0 for elements with index 2048 … 2559 (1st half of block 2),

5 for elements with index 2560 … 3071 (2nd half of block 2),

0 for elements with index 3072 … 3583 (1st half of block 3),

7 for elements with index 3584 … 4095 (2nd half of block 3).

My question now is what am I doing wrong when I copy the elements from the global to the shared memory? It seems to be working ok for blockGrpSize=2. I also checked that pairs’ block indexes are computed correctly (x0_idx and x1_idx). My guess is that it’s a syncronization problem?

PS: Here is the output of deviceQuery, in case you should know:

There is 1 device supporting CUDA

Device 0: "GeForce 8500 GT"

  CUDA Driver Version:                           3.20

  CUDA Runtime Version:                          3.20

  CUDA Capability Major/Minor version number:    1.1

  Total amount of global memory:                 536150016 bytes

  Multiprocessors x Cores/MP = Cores:            2 (MP) x 8 (Cores/MP) = 16 (Cores)

  Total amount of constant memory:               65536 bytes

  Total amount of shared memory per block:       16384 bytes

  Total number of registers available per block: 8192

  Warp size:                                     32

  Maximum number of threads per block:           512

  Maximum sizes of each dimension of a block:    512 x 512 x 64

  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1

  Maximum memory pitch:                          2147483647 bytes

  Texture alignment:                             256 bytes

  Clock rate:                                    0.92 GHz

  Concurrent copy and execution:                 Yes

  Run time limit on kernels:                     Yes

  Integrated:                                    No

  Support host page-locked memory mapping:       Yes

  Compute mode:                                  Default (multiple host threads can use this device simultaneously)

  Concurrent kernel execution:                   No

  Device has ECC support enabled:                No

  Device is using TCC driver mode:               No

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 3.20, CUDA Runtime Version = 3.20, NumDevs = 1, Device = GeForce 8500 GT

PASSED

Thank you for taking the time to read my post! I will appreciate your answers.

Andrew.

I think the problem may be here

This writes to global memory

a[blockIdx.x*1024 + threadIdx.x + blockDim.x].x = blockIdx.x * 2 + 1;

and you immediately read back from global memory here

// 1st half of the left block

     aS[threadIdx.x] = a[x0_idx*1024 + threadIdx.x];

     // 1st half of the right block

     aS[threadIdx.x + blockDim.x] = a[x1_idx*1024 + threadIdx.x];

....

But if you have more blocks than your GPU can fit onto MP at once then second chunk of code can at times be trying to read data before it has been written, as the block that writes the data has even started on an MP yet.

Imagine what will happen if there is a large number of blocks, say 20000

I suspect that the first line of code is just setting up data for debugging, if so do that with another kernel.

Thanks for your answer,

I’ve figured that it was a block syncronization issue, so the solution I’ve found was to only work with the global memory assigned to a block within a kernel, then launch the kernel again so as to read/write the global memory previously assigned to other blocks in the previous kernel.