I would like to have several bits of data whose length is determined at kernel launch time stored in shared memory including two arrays and some other variables; however the two arrays are of different types as are the other variables. For example, in an ideal world, I would be able to have something like:
extern __shared__ float d_shared_originalXTimeSeriesArray[];
extern __shared__ double d_shared_doubleScratchArray[];
extern __shared__ double d_shared_Xmean;
extern __shared__ unsigned int d_timeSeriesLength;
But it appears that this is not doable.
The only way I know of doing this now is to have one shared array and cast parts of it to other values. For example, I declare:
extern __shared__ char *shared_array
And during the kernel call, I determine the size of the whole shared_array using something like:
unsigned int time_series_length = 119;
unsigned int array_size = 0;
array_size += sizeof(unsigned int); // for d_timeSeriesLength (must be at a known offset, so 0 seems good).
array_size += sizeof(double); // for d_shared_Xmean
array_size += time_series_length * sizeof(float); // for d_shared_originalXTimeSeriesArray[]
array_size += time_series_length * sizeof(double); // for d_shared_doubleScratchArray[]
myKernel <<< nBlocks, blockSize, array_size >>>(foo);
Inside myKernel, I think I then must then calculate the offsets (and use up registers) like so:
unsigned int d_timeSeriesLength_Offset = 0;
unsigned int d_shared_Xmean_Offset = d_timeSeriesLength_Offset + sizeof(unsigned int);
unsigned int d_shared_originalXTimeSeriesArray_Offset = d_shared_Xmean_Offset + shared_array[d_timeSeriesLength_Offset] * sizeof(float);
unsigned int d_shared_doubleScratchArray_Offset = d_shared_originalXTimeSeriesArray_Offset + shared_array[d_timeSeriesLength_Offset] * sizeof(double);
After which I can access/store the values in this cumbersome manner:
(unsigned int)shared_array[d_timeSeriesLength_Offset]; // the time series length.
(double)shared_array[d_shared_Xmean_Offset]; // time time series mean.
(float)shared_array[d_shared_originalXTimeSeriesArray_Offset + i]; // the i-th value of the time series.
And since the time series are rather short (to date < 512 time points in our case), this can all fit into shared memory no problem. But I must assume this is NOT the best way to do this. Is there a better way? I tried to read about extern shared struct usage, but it looks dubious at best.
If there is nothing, I might resort to macros to clean up the code and eliminate the register usage.
Any suggestions?