top = stack_sdata[threadIdx.x * max_stack_depth + stack_depth - 1];
Say, together with stack of floats I need a stack of numbers with maximum value <= 255, so one byte for such stack item is enough. I can declare another stack of bytes like this: extern shared unsigned char stack_sdata_char; however, the question arises: will this slow things down or not.
I mean - is it harmless to use one byte numbers instead of 4 or 8 byte numbers in shared memory ?
Reading single bytes from shared memory causes bank conflicts because four threads will want to read from the same 32-bit bank at the same time. The hardware has to serialize this into 4 separate reads. That said, shared memory is so fast, that even 4-way bank conflicts might not be a noticeable performance hit in your application. If saving shared memory space is important, I’d say try the char stack and see.
Oh, but you should be aware that CUDA handles the dynamic shared memory (which you are doing with the extern keyword) in a peculiar way. Since there is only one shared memory parameter in the kernel call, CUDA doesn’t know how much to assign to each dynamic array. (This isn’t a problem for the statically sized shared arrays, which CUDA handles separately.) If you want two dynamic shared arrays, you have to do something like this:
extern __shared__ char *smem; // char type is irrelevant here
float *stack_sdata = (float *) smem;
char *stack_cdata = (char *) (smem + sizeof(float) * sdata_size);
Where sdata_size is the number of elements in the float array, which you would have to pass in as a parameter to the kernel. The value for Ns in the host kernel call will then be:
because cuda states that shared memory arrays whos size is not defined for all threads point to the same address ( I believe it points just to the 0000000 start of the shared memory). So declaring two arrays would mean declaring two equivalent pointers.
I suggest to read Cuda Programming Guide, there are more details there.