my blocksize varies at runtime, depending on whcih compute capability the gpu has, the program chooses 128 or 256 threads_per_block
However, i have problems using the same kernel for the two different cases, because of the size of the shared memory which must be hardcoded.
im aware that one solution would be to copy paste my kernel and make the 128 and 256 versions separately, but is there another better way to dynamically set those shared arrays at runtime so my code is mantained on only one kernel?
CUDA_ARCH is not useful i guess, because it sets the values at compile time and the program would keep those blocksize values forever
thanks in advance