shared memory array can size be the same as blocksize by default?

neoideo · March 6, 2011, 2:54pm

hello everyone,

my blocksize varies at runtime, depending on whcih compute capability the gpu has, the program chooses 128 or 256 threads_per_block
However, i have problems using the same kernel for the two different cases, because of the size of the shared memory which must be hardcoded.

im aware that one solution would be to copy paste my kernel and make the 128 and 256 versions separately, but is there another better way to dynamically set those shared arrays at runtime so my code is mantained on only one kernel?

CUDA_ARCH is not useful i guess, because it sets the values at compile time and the program would keep those blocksize values forever

thanks in advance
Cristobal

avidday · March 6, 2011, 3:03pm

CUDA allows dynamic allocation of shared memory, so you can defined the size of a shared memory allocation at run time. You could also use templating, which is a rather clean way of having one kernel code compiled into different versions. Take this templated kernel as an example:

template <typename Real, unsigned int rows>

__global__ void fimTag(const Real *phi0, const Real * phi1, int *L, int *activated,

                const unsigned int m, const unsigned int n, const unsigned int lda)

{

        const unsigned int bsize = rows * 16;

        volatile __shared__ int active [bsize];

volatile unsigned int tidx = 1 + threadIdx.x + blockIdx.x * blockDim.x;

        volatile unsigned int tidy = 1 + threadIdx.y + blockIdx.y * blockDim.y;

volatile unsigned int gid = tidx * lda + tidy;

        volatile unsigned int tid = threadIdx.x + blockDim.x * threadIdx.y;

        volatile unsigned int bid = blockIdx.x + gridDim.x * blockIdx.y;

active[tid] = ( L[gid] == 1 );

        __syncthreads();

if (tid < 32) { 

                for(int i=tid+32; i<bsize; i+=32) {

                        active[tid] = (active[tid] || active[i]);

                }

                int aval = __any( active[tid] > 0);

                if (tid == 0) { activated[bid] = aval; }

        }

}

Using templating, it becomes possible to compile versions with different shared memory sizes, all from the same kernel source.

neoideo · March 6, 2011, 6:28pm

im interested in dynamic shared memory allocation at runtime
so i tried the template, and i managed put a conditional on host code and depending on the architecture i call the kernel with its corresponding template argument (128 or 256)

thanks really!

avidday · March 6, 2011, 6:36pm

But you really only need one. Just assign pointers as offsets into the single assignment the runtime supports:

extern __shared__ float shmbuff[];

float *v0 = &shmbuff[0];

float *v1 = &shmbuff[blockDim.x];

float *v2 = &shmbuff[2 * blockDim.x];

etc.

tera · March 6, 2011, 7:33pm

Actually [font=“Courier New”]CUDA_ARCH[/font] is helpful in your case, if used in combination with nvcc’s [font=“Courier New”]–generate-code[/font] option. You can pass multiple [font=“Courier New”]arch=[/font] and [font=“Courier New”]code=[/font] arguments to it, which results in multiple nvopencc invocations for the different architectures. The binary suitable for the present GPU is then automatically selected at runtime.

Using conditional compilation depending on [font=“Courier New”]CUDA_ARCH[/font] you can then adjust the size of shared memory arrays and any other implementation details depening on architecture.

The only complication in your case is that you also want to vary the blocksize depending on architecture, which however is set from host code. While different solutions to this problem are possible, I would set a variable in device code (where [font=“Courier New”]CUDA_ARCH[/font] is available) to the desired blocksize, and copy it to host code at runtime using [font=“Courier New”]cudaMemcpyFromSymbol()[/font]. This keeps all decisions on architecture local do one file, and prevents problems if the runtime ever decides to load a different binary than you expect.

neoideo · March 6, 2011, 7:34pm

But you really only need one. Just assign pointers as offsets into the single assignment the runtime supports:
extern __shared__ float shmbuff[];

float *v0 = &shmbuff[0];

float *v1 = &shmbuff[blockIdx.x];

float *v2 = &shmbuff[2 * blockIdx.x];
etc.

i get your idea, i discarded it because i have specific offsets input-data dependant, most of the times different than a multiple of 16, assuming that it would affect performance.

is that true?

tera · March 6, 2011, 7:36pm

Of course, you can achieve the same effect using templates and manual selection of the appropriate kernel at runtime, as avidday suggested.

neoideo · March 6, 2011, 7:41pm

thanks tera for your sugestion,

you know what, at the beginning i wanted to play with CUDA_ARCH but for some reason i was having some problems with the variable undefined (even on device code) so i moved onto this.

tera · March 6, 2011, 7:45pm

CUDA_ARCH is undefined in host code (as the same host code is used for different devices). This actually is the recommended way to distinguish between host and device code.

Topic		Replies	Views
how to dynamically allocate shared memory CUDA Programming and Performance	1	4027	June 26, 2009
how to create arrays in runtime in shared memory? CUDA Programming and Performance	3	996	December 26, 2011
how to assign shared memory size with variable blockDim.x blockDim.y and blockDim.z CUDA Programming and Performance	4	6894	September 29, 2010
How to dynamically allocate shared memory? in _global__ or __device__ functions CUDA Programming and Performance	8	27278	October 7, 2010
A question of using shared memory CUDA Programming and Performance	5	5405	March 12, 2008
Need for dynamic allocated shared memory? CUDA Programming and Performance	2	3544	March 4, 2011
Shared Memory initialization CUDA Programming and Performance	19	45309	March 26, 2007
__shared__ CUDA 9.0 programming Guide v. 0_Simple/matrixMul.cu CUDA Programming and Performance	5	900	December 17, 2017
Dynamic memory allocation CUDA Programming and Performance	4	2911	July 11, 2007
Dynamic Memory Question CUDA Programming and Performance	2	1145	January 21, 2010

shared memory array can size be the same as blocksize by default?

Related topics