how to create arrays in runtime in shared memory?

Elan · December 24, 2011, 10:41pm

Hello!

I have the task of large number of threads running, each doing a small matrix multiplication. All the small matrices have been loaded to the global memory. I wish to improve performance by letting each thread load its small matrices into shared memory, and then compute the product. But the problem is that I do not know the sizes of the matrices during compile time. So I cannot create variables as in shared double mat1[xsize][ysize]. On PC, I would have made a dynamic allocation. But I do not know if I could do it on the shared memory. If calling malloc in a kernel would allocate only in global memory, that does not help either.

Is there a way to declare arrays during runtime in kernel? Is there any other way to resolve this problem?

Thank you,
Elan.

Jimmy_Pettersson · December 25, 2011, 4:42am

You can pass the desired size upon kernel invocation, see section B.16 Execution configuration in the programming guide.

If I remember correctly you do something like ex:

extern __shared__ float smemData[] ;

__global__ void youtKernel( --- )

{

smemData[threadIdx.x] = globalPtr[ threadIdx.x + blockIdx.x*blockDim.x];

... 

}

// invocation

yourKernel<<< gridDim, blockDim, smemSize >>>(---- );

Overall the documentation seems a bit scarce on the subject…

Elan · December 26, 2011, 4:13pm

This method allows allocation of the same amount of memory to each of the thread dynamically. I have to populate each thread with differently sized matices, sizes whose upper and lower bounds I do not know yet.

But thank you very much for the reply and the reference. It is a good starting point.

You can pass the desired size upon kernel invocation, see section B.16 Execution configuration in the programming guide.

If I remember correctly you do something like ex:
extern __shared__ float smemData[] ;

__global__ void youtKernel( --- )

{

smemData[threadIdx.x] = globalPtr[ threadIdx.x + blockIdx.x*blockDim.x];

... 

}

// invocation

yourKernel<<< gridDim, blockDim, smemSize >>>(---- );
Overall the documentation seems a bit scarce on the subject…

tera · December 26, 2011, 5:29pm

You probably want to do a tiled matrix multiplication for optimal use of shared memory. In that case, you can use a constant tile size even if the matrix sizes are all different.

You also probably want to assign (at least) a block per matrix, not a single thread.

Topic		Replies	Views
variable sized arrays on device CUDA Programming and Performance	5	4744	August 5, 2011
Dynamic memory allocation CUDA Programming and Performance	4	2914	July 11, 2007
shared memory dynamic allocation multiple arrays in shared memory allocated dynamically ?? CUDA Programming and Performance	0	866	December 28, 2009
shared memory dynamic allocation multiple arrays in shared memory allocated dynamically ?? CUDA Programming and Performance	2	9001	December 29, 2009
How to dynamically allocate shared memory? in _global__ or __device__ functions CUDA Programming and Performance	8	27291	October 7, 2010
Shared Memory initialization CUDA Programming and Performance	19	45327	March 26, 2007
extern __shared__ does not allocate memory CUDA Programming and Performance	1	7488	December 1, 2009
how to assign shared memory size with variable blockDim.x blockDim.y and blockDim.z CUDA Programming and Performance	4	6901	September 29, 2010
Shared Memory - Dynamic Allocation CUDA Programming and Performance	2	21422	November 21, 2008
A question of using shared memory CUDA Programming and Performance	5	5410	March 12, 2008

how to create arrays in runtime in shared memory?

Related topics