What are the unexpected differences between these styles of chopping up shared memory?
// Block 1 Kernel invoked with 128*sizeof(unsigned int) extern __shared__ unsigned int shared; unsigned int *shared_data = &shared; unsigned int *shared_rank = &shared; // Offset could be passed in too. // Block 2 __shared__ unsigned int shared; unsigned int *shared_data = &shared; unsigned int *shared_rank = &shared; // Block 3 __shared__ unsigned int shared_data; __shared__ unsigned int shared_rank; .... sort_rank( shared_rank, shared_data, thread ); ....
I’ve watered down my program to the above pseudo code.
I’d like to avoid static shared memory allocation and be able to size things at run time (like “Block 1”), however all but “Block 3” result in some type of error which hangs the device. sort_rank runs a fixed length loop based on blockDim. I’ve tried adding bounds checking on array access (i.e. array[min(63,(unsigned)i]), without luck. All work in the emulator.
Thanks for any advice.
Edit: Looks like this was partially addressed here (I should have scrolled down…) Although the solution in the second block of code on in that post doesn’t compile, complaining that shared pointers can’t have initializers.