Hi everyone!

It’s been about a week since I’m banging my head to get these issues fixed, so I’m counting for your help :)

So, I basically have an array of floats that I copy from host to device, do some computing on the array elements, then copy it back to host.

This is how I call the kernel:

```
// setup execution parameters
int block_size = 512;
dim3 dimBlock(block_size, 1, 1);
int n_blocks = N/1024 + (N%1024 == 0 ? 0:1);
dim3 dimGrid(n_blocks, 1, 1);
int blockGrpSize = 2;
// execute the kernel
mykernel<<< dimGrid, dimBlock >>>(a_d, N, blockGrpSize);
[...]
/**
* a - the array of size N - in global memory
* N - number of elements
* blockGrpSize - size of block group
*/
__global__ void mykernel(cuFloatComplex *a, int N, int blockGrpSize)
{
// general purpose variables
int x0_idx, x1_idx;
// array stored in block's shared memory
__shared__ cuFloatComplex aS[1024];
// computing block group's left member
x0_idx = (blockIdx.x >= ((blockIdx.x / blockGrpSize)*blockGrpSize + blockGrpSize/2)) ? (blockIdx.x - blockGrpSize/2) : blockIdx.x;
// computing block group's right member
x1_idx = x0_idx + blockGrpSize/2;
// assigning values to the array elements in the global memory in 512 batches
// rememeber that blockDim.x = 512
a[blockIdx.x*1024 + threadIdx.x].x = blockIdx.x * 2;
a[blockIdx.x*1024 + threadIdx.x + blockDim.x].x = blockIdx.x * 2 + 1;
// copy elements from global memory to shared memory
aS[threadIdx.x] = a[blockIdx.x*1024 + threadIdx.x]; // 1st half of a block
aS[threadIdx.x + blockDim.x] = a[blockIdx.x*1024 + threadIdx.x + blockDim.x]; // 2nd half of a block
// copy elements back to global memory
a[blockIdx.x*1024 + threadIdx.x] = aS[threadIdx.x];
a[blockIdx.x*1024 + threadIdx.x + blockDim.x] = aS[threadIdx.x + blockDim.x];
return;
}
```

Each kernel block will have 512 threads and for each block there are 1024 corresponding elements in the device’s global memory array, which I copy in the shared memory, do some computations on them, then copy them back.

If I run the above example for N=4096 (thus 4 blocks), basically I initialize the elements in the global memory array, copy them to the shared memory, do nothing to them, then copy them back to the global memory.

If I print out the resulting array, I will see the value

0 for elements with index 0 … 511 (1st half of block 0),

1 for elements with index 512 … 1023 (2nd half of block 0),

2 for elements with index 1024 … 1535 (1st half of block 1),

3 for elements with index 1536 … 2047 (2nd half of block 1),

4 for elements with index 2048 … 2559 (1st half of block 2),

5 for elements with index 2560 … 3071 (2nd half of block 2),

6 for elements with index 3072 … 3583 (1st half of block 3),

7 for elements with index 3584 … 4095 (2nd half of block 3).

All good so far, except that I need to combine the elements from these blocks. For this I use the 3rd parameter.

For example:

if blockGrpSize==2 – I make the following pairs: (block0, block1), (block2, block3)

if blockGrpSize==4 – I make the following pairs: (block0, block2), (block1, block3).

These pairs’ left and right block member’s indexes are computed in the following lines:

```
// computing block group's left member
x0_idx = (blockIdx.x >= ((blockIdx.x / blockGrpSize)*blockGrpSize + blockGrpSize/2)) ? (blockIdx.x - blockGrpSize/2) : blockIdx.x;
// computing block group's right member
x1_idx = x0_idx + blockGrpSize/2;
```

By combining the elements in these blocks I mean replacing the right block’s first half elements with the left block’s second half elements and viceversa.

For example, if the elements in the left block are [0 0 1 1] and the ones in the right block are [2 2 3 3], by combining the blocks I would obtain [0 0 2 2] in the left block and [1 1 3 3] in the right block.

```
__global__ void mykernel(cuFloatComplex *a, int N, int blockGrpSize)
{
// general purpose variables
int x0_idx, x1_idx;
// array stored in block's shared memory
__shared__ cuFloatComplex aS[1024];
// computing block group's left member
x0_idx = (blockIdx.x >= ((blockIdx.x / blockGrpSize)*blockGrpSize + blockGrpSize/2)) ? (blockIdx.x - blockGrpSize/2) : blockIdx.x;
// computing block group's right member
x1_idx = x0_idx + blockGrpSize/2;
// assigning values to the array elements in the global memory in 512 batches
// rememeber that blockDim.x = 512
a[blockIdx.x*1024 + threadIdx.x].x = blockIdx.x * 2;
a[blockIdx.x*1024 + threadIdx.x + blockDim.x].x = blockIdx.x * 2 + 1;
if (blockIdx.x==x0_idx) {
// code for the left block
// 1st half of the left block
aS[threadIdx.x] = a[x0_idx*1024 + threadIdx.x];
// 1st half of the right block
aS[threadIdx.x + blockDim.x] = a[x1_idx*1024 + threadIdx.x];
} else {
// code for the right block
// 2nd half of the left block
aS[threadIdx.x] = a[x0_idx*1024 + threadIdx.x + blockDim.x];
// 2nd half of the right block
aS[threadIdx.x + blockDim.x] = a[x1_idx*1024 + threadIdx.x + blockDim.x];
}
__syncthreads();
// copy elements back to global memory
a[blockIdx.x*1024 + threadIdx.x] = aS[threadIdx.x];
a[blockIdx.x*1024 + threadIdx.x + blockDim.x] = aS[threadIdx.x + blockDim.x];
return;
}
```

For blockGrpSize=2 I get the correct result, ie:

- the block pairs are (0,1) and (2,3), so by switching the left block’s 2nd half elements with the right block’s 1st half elements I get

0 for elements with index 0 … 511 (1st half of block 0),

2 for elements with index 512 … 1023 (2nd half of block 0),

1 for elements with index 1024 … 1535 (1st half of block 1),

3 for elements with index 1536 … 2047 (2nd half of block 1),

4 for elements with index 2048 … 2559 (1st half of block 2),

6 for elements with index 2560 … 3071 (2nd half of block 2),

5 for elements with index 3072 … 3583 (1st half of block 3),

7 for elements with index 3584 … 4095 (2nd half of block 3).

For blockGrpSize=4 I should get the following result:

- the block pairs are (0,2) and (1,3), so by switching the left block’s 2nd half elements with the right block’s 1st half elements I should get

0 for elements with index 0 … 511 (1st half of block 0),

4 for elements with index 512 … 1023 (2nd half of block 0),

2 for elements with index 1024 … 1535 (1st half of block 1),

6 for elements with index 1536 … 2047 (2nd half of block 1),

1 for elements with index 2048 … 2559 (1st half of block 2),

5 for elements with index 2560 … 3071 (2nd half of block 2),

3 for elements with index 3072 … 3583 (1st half of block 3),

7 for elements with index 3584 … 4095 (2nd half of block 3).

But instead I get

0 for elements with index 0 … 511 (1st half of block 0),

**0** for elements with index 512 … 1023 (2nd half of block 0),

2 for elements with index 1024 … 1535 (1st half of block 1),

**0** for elements with index 1536 … 2047 (2nd half of block 1),

**0** for elements with index 2048 … 2559 (1st half of block 2),

5 for elements with index 2560 … 3071 (2nd half of block 2),

**0** for elements with index 3072 … 3583 (1st half of block 3),

7 for elements with index 3584 … 4095 (2nd half of block 3).

My question now is what am I doing wrong when I copy the elements from the global to the shared memory? It seems to be working ok for blockGrpSize=2. I also checked that pairs’ block indexes are computed correctly (x0_idx and x1_idx). My guess is that it’s a syncronization problem?

PS: Here is the output of deviceQuery, in case you should know:

```
There is 1 device supporting CUDA
Device 0: "GeForce 8500 GT"
CUDA Driver Version: 3.20
CUDA Runtime Version: 3.20
CUDA Capability Major/Minor version number: 1.1
Total amount of global memory: 536150016 bytes
Multiprocessors x Cores/MP = Cores: 2 (MP) x 8 (Cores/MP) = 16 (Cores)
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 16384 bytes
Total number of registers available per block: 8192
Warp size: 32
Maximum number of threads per block: 512
Maximum sizes of each dimension of a block: 512 x 512 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 256 bytes
Clock rate: 0.92 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: Yes
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)
Concurrent kernel execution: No
Device has ECC support enabled: No
Device is using TCC driver mode: No
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 3.20, CUDA Runtime Version = 3.20, NumDevs = 1, Device = GeForce 8500 GT
PASSED
```

Thank you for taking the time to read my post! I will appreciate your answers.

Andrew.