Complex extern __shared__ memory Best practices

I would like to have several bits of data whose length is determined at kernel launch time stored in shared memory including two arrays and some other variables; however the two arrays are of different types as are the other variables. For example, in an ideal world, I would be able to have something like:

extern __shared__ float d_shared_originalXTimeSeriesArray[];

extern __shared__ double d_shared_doubleScratchArray[];

extern __shared__ double d_shared_Xmean;

extern __shared__ unsigned int d_timeSeriesLength;

But it appears that this is not doable.

The only way I know of doing this now is to have one shared array and cast parts of it to other values. For example, I declare:

extern __shared__ char *shared_array

And during the kernel call, I determine the size of the whole shared_array using something like:

unsigned int time_series_length = 119;

unsigned int array_size = 0;

array_size += sizeof(unsigned int); // for d_timeSeriesLength (must be at a known offset, so 0 seems good).

array_size += sizeof(double); // for d_shared_Xmean

array_size += time_series_length * sizeof(float); // for d_shared_originalXTimeSeriesArray[]

array_size += time_series_length * sizeof(double); // for d_shared_doubleScratchArray[]

myKernel <<< nBlocks, blockSize, array_size >>>(foo);

Inside myKernel, I think I then must then calculate the offsets (and use up registers) like so:

unsigned int d_timeSeriesLength_Offset = 0;

unsigned int d_shared_Xmean_Offset = d_timeSeriesLength_Offset + sizeof(unsigned int);

unsigned int d_shared_originalXTimeSeriesArray_Offset = d_shared_Xmean_Offset + shared_array[d_timeSeriesLength_Offset] * sizeof(float);

unsigned int d_shared_doubleScratchArray_Offset = d_shared_originalXTimeSeriesArray_Offset + shared_array[d_timeSeriesLength_Offset] * sizeof(double);

After which I can access/store the values in this cumbersome manner:

(unsigned int)shared_array[d_timeSeriesLength_Offset]; // the time series length.

(double)shared_array[d_shared_Xmean_Offset]; // time time series mean.

(float)shared_array[d_shared_originalXTimeSeriesArray_Offset + i]; // the i-th value of the time series.

And since the time series are rather short (to date < 512 time points in our case), this can all fit into shared memory no problem. But I must assume this is NOT the best way to do this. Is there a better way? I tried to read about extern shared struct usage, but it looks dubious at best.

If there is nothing, I might resort to macros to clean up the code and eliminate the register usage.

Any suggestions?

For dynamically allocated shared memory, there really isn’t much choice but to index into the single allocation the language supports per block. You can still have separate statically declared variables for those of known size, so it is only the ararys you need to size at runtime that require indexing or pointer calculation. Because shared memory uses 32 bit banks, you would probably be better off declaring the dynamically allocated block as a 32 bit type rather than char. The compile is more likely to do a good job for you if the type size you choose is 32 bit.

Have a look at appendix B.2.3 of the Programming Guide. It presents a nicer way to write the necessary pointer arithmetics.