Dynamic Allocation of Shared Memory

I would like to dynamically allocate memory in shared memory however I am having difficulty doing this. I have followed the advice found here, http://www.drdobbs.com/high-performance-computing/208801731;jsessionid=EV5BCGNX20BR3QE1GHRSKH4ATMY32JVN?pgno=2

But when I implement this method for two arrays, some information washes out.

In my cuda code I pass in a block size of 6 and a grid size of 4 (both 1D):

global void run_sim(float *a, float b, int buf_sz){
const int g_idx = (blockIdx.x
blockDim.x+threadIdx.x+buf_sz);
const int l_idx = threadIdx.x+buf_sz;;

extern shared int s_data;
int * aT;
aT = &s_data[0];
aT[l_idx] = g_idx;
a[g_idx] = aT[l_idx];
}
And this code works as I expect, returning
[0 0 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 0 0]
(buf_sz=2)

Now I try to create two array by modifying the code as follows:
global void run_sim(float *a, float b, int buf_sz){
const int g_idx = (blockIdx.x
blockDim.x+threadIdx.x+buf_sz);
const int l_idx = threadIdx.x+buf_sz;;

extern shared int s_data;
int * aT;
aT = &s_data[0];
aT[l_idx] = g_idx;
a[g_idx] = aT[l_idx];

int * bT;
bT = &s_data[300]; //or desired size of a
bT[l_idx] = g_idx;
b[g_idx] = bT[l_idx];
}

returns b as expected, however now a is stored in global memory as:
[ 2 3 0 0 0 0 0 0 10 11 0 0 0 0 0 0 18 19 0 0 0 0 0 0]

The questions are:

  1. Does anyone know of a better way to dynamically allocate arrays in shared memory?
  2. Can anyone diagnose the above problem?

Thanks,

G