my blocksize varies at runtime, depending on whcih compute capability the gpu has, the program chooses 128 or 256 threads_per_block
However, i have problems using the same kernel for the two different cases, because of the size of the shared memory which must be hardcoded.
im aware that one solution would be to copy paste my kernel and make the 128 and 256 versions separately, but is there another better way to dynamically set those shared arrays at runtime so my code is mantained on only one kernel?
CUDA_ARCH is not useful i guess, because it sets the values at compile time and the program would keep those blocksize values forever
CUDA allows dynamic allocation of shared memory, so you can defined the size of a shared memory allocation at run time. You could also use templating, which is a rather clean way of having one kernel code compiled into different versions. Take this templated kernel as an example:
template <typename Real, unsigned int rows>
__global__ void fimTag(const Real *phi0, const Real * phi1, int *L, int *activated,
const unsigned int m, const unsigned int n, const unsigned int lda)
{
const unsigned int bsize = rows * 16;
volatile __shared__ int active [bsize];
volatile unsigned int tidx = 1 + threadIdx.x + blockIdx.x * blockDim.x;
volatile unsigned int tidy = 1 + threadIdx.y + blockIdx.y * blockDim.y;
volatile unsigned int gid = tidx * lda + tidy;
volatile unsigned int tid = threadIdx.x + blockDim.x * threadIdx.y;
volatile unsigned int bid = blockIdx.x + gridDim.x * blockIdx.y;
active[tid] = ( L[gid] == 1 );
__syncthreads();
if (tid < 32) {
for(int i=tid+32; i<bsize; i+=32) {
active[tid] = (active[tid] || active[i]);
}
int aval = __any( active[tid] > 0);
if (tid == 0) { activated[bid] = aval; }
}
}
Using templating, it becomes possible to compile versions with different shared memory sizes, all from the same kernel source.
im interested in dynamic shared memory allocation at runtime
so i tried the template, and i managed put a conditional on host code and depending on the architecture i call the kernel with its corresponding template argument (128 or 256)
Actually [font=“Courier New”]CUDA_ARCH[/font] is helpful in your case, if used in combination with nvcc’s [font=“Courier New”]–generate-code[/font] option. You can pass multiple [font=“Courier New”]arch=[/font] and [font=“Courier New”]code=[/font] arguments to it, which results in multiple nvopencc invocations for the different architectures. The binary suitable for the present GPU is then automatically selected at runtime.
Using conditional compilation depending on [font=“Courier New”]CUDA_ARCH[/font] you can then adjust the size of shared memory arrays and any other implementation details depening on architecture.
The only complication in your case is that you also want to vary the blocksize depending on architecture, which however is set from host code. While different solutions to this problem are possible, I would set a variable in device code (where [font=“Courier New”]CUDA_ARCH[/font] is available) to the desired blocksize, and copy it to host code at runtime using [font=“Courier New”]cudaMemcpyFromSymbol()[/font]. This keeps all decisions on architecture local do one file, and prevents problems if the runtime ever decides to load a different binary than you expect.
i get your idea, i discarded it because i have specific offsets input-data dependant, most of the times different than a multiple of 16, assuming that it would affect performance.
you know what, at the beginning i wanted to play with CUDA_ARCH but for some reason i was having some problems with the variable undefined (even on device code) so i moved onto this.
CUDA_ARCH is undefined in host code (as the same host code is used for different devices). This actually is the recommended way to distinguish between host and device code.