Dynamic Shared Memory allocation of more than one array

akavo · June 19, 2011, 4:11am

Hello, I have three questions whose answers I couldn’t find on the documentation or the forum.

I read about how I could pass the size of a dynamically allocated array stored in shared memory as a template parameter.
However I need to dynamically allocate two arrays in shared memory with different lengths. How do I go about doing this?
My second question deals with simple structs and whether their member variables are stored in registers or in the local memory. I have a struct with a minimal number of primitive member fields which are used frequently by the kernel. How can I make sure that these fields are stored in registers rather than the slow local memory? Does the “register” keyword do anything? I’ve heard that modern compilers ignore it even in regular c++.
Finally, one of my kernels needs a small array of numbers as an argument. I pass the array through constant memory but my kernel is invoked frequently and the copy to constant memory has quite some overhead. I could also copy the array to global memory but I do not expect the performance to be any different.

Is there any way I can pass the contents of the array “by value” avoiding a memcopy? I can’t put the values in a struct because I don’t know the size of the array a-priori.

I would really appreciate your knowledge on these issues.
Thanks

hyqneuron · June 19, 2011, 5:09am

Allocate shared memory with the combined size of the two arrays. Pass the size of the first array to the kernel as a parameter so the kernel would know the offset of the starting address of the second array.
In the documentation that I’ve read I’ve never seen any mentioning of the register keyword. I would expect the compiler to be smart enough to leave frequently used variables of structures in register. You can confirm the behaviour of the compiler using cuobjdump -sass file. In case the behaviour is not what you want, you can declare the things you want to keep in registers as variables and do the load and store explicitly before and after the computation.
You’re saying that every time the kernel is invoked the argument array would be different? What you are thinking is not impossible, but it would appear quite weird to me. Kernel arguments must have fixed size known at compile time, and to do what you want perhaps you’ll need an argument space as large as the maximal size of the array you want to pass in.

__global__ void kernel(int arg1, int arg2, int arg3, int arg4, int arg5, int arg6, int arg7, int arg8...)

{

}

Wouldn’t this be very weird? Copying to global memory looks like a rather straightforward solution and the overhead can be completely hidden if the second kernel launch’s arguments do not depend on the result of the first kernel launch, and if the kernel is not extremely short.

akavo · June 19, 2011, 6:07pm

Allocate shared memory with the combined size of the two arrays. Pass the size of the first array to the kernel as a parameter so the kernel would know the offset of the starting address of the second array.

In the documentation that I’ve read I’ve never seen any mentioning of the register keyword. I would expect the compiler to be smart enough to leave frequently used variables of structures in register. You can confirm the behaviour of the compiler using cuobjdump -sass file. In case the behaviour is not what you want, you can declare the things you want to keep in registers as variables and do the load and store explicitly before and after the computation.

You’re saying that every time the kernel is invoked the argument array would be different? What you are thinking is not impossible, but it would appear quite weird to me. Kernel arguments must have fixed size known at compile time, and to do what you want perhaps you’ll need an argument space as large as the maximal size of the array you want to pass in.
__global__ void kernel(int arg1, int arg2, int arg3, int arg4, int arg5, int arg6, int arg7, int arg8...)

{

}
Wouldn’t this be very weird? Copying to global memory looks like a rather straightforward solution and the overhead can be completely hidden if the second kernel launch’s arguments do not depend on the result of the first kernel launch, and if the kernel is not extremely short.

What if the two arrays had different types?
I’d rather not trust the compiler. If anyone knows more about how structs are stored in memory it would be helpful. I could use the variables without the struct but it would be sort of an ugly solution.
That’s what I thought. I was hoping for some sort of trick to pass arrays as parameters, because I have a short kernel which is executed very often and the overhead is noticeable relative to the time it takes the kernel to execute.

hyqneuron · June 20, 2011, 1:50pm

__global__ void kernel(int size1, int size2)

{

	extern __shared__ Type1 Array1[];

	Type2 *Array2 = (Type2*)(Array1 + size1 / sizeof(Type1));

}

The compiler rarely resorts to local memory. If you pass the struct(s) in from global memory, then the struct(s) will at least get cached at L1. If you declare an array of struct from inside your kernel, and the array cannot fit into the available amount of registers, then the array will be stored in local memory, which is cached at L2 only. If you declare the array from inside the kernel and the array is small enough to fit into the registers that a thread has, then the entire array will stay in registers. In the cases where global or local memory is used, it is without doubt that the compiler must load certain parts of the struct into register in order to do any non-atomic operations. As for how long the the loaded parts can stay in register it certainly depends on your register pressure as well as the frequency of reference.
If I were you, I would try altering the algorithm to make a single kernel launch run long enough.

If you’re on Windows Vista/7, maybe you want to go on Linux instead or switch to the TCC driver.

akavo · June 20, 2011, 5:01pm

__global__ void kernel(int size1, int size2)

{

	extern __shared__ Type1 Array1[];

	Type2 *Array2 = (Type2*)(Array1 + size1 / sizeof(Type1));

}
The compiler rarely resorts to local memory. If you pass the struct(s) in from global memory, then the struct(s) will at least get cached at L1. If you declare an array of struct from inside your kernel, and the array cannot fit into the available amount of registers, then the array will be stored in local memory, which is cached at L2 only. If you declare the array from inside the kernel and the array is small enough to fit into the registers that a thread has, then the entire array will stay in registers. In the cases where global or local memory is used, it is without doubt that the compiler must load certain parts of the struct into register in order to do any non-atomic operations. As for how long the the loaded parts can stay in register it certainly depends on your register pressure as well as the frequency of reference.

If I were you, I would try altering the algorithm to make a single kernel launch run long enough.

If you’re on Windows Vista/7, maybe you want to go on Linux instead or switch to the TCC driver.

Yes I’d rather do more work in the kernel but the nature of the program won’t allow that.

Thank you, you’ve been very helpful.

Topic		Replies	Views
stupid memory question CUDA Programming and Performance	11	2066	July 28, 2009
Efficient way of reading dynamic array in kernel? CUDA Programming and Performance	5	1613	July 12, 2010
In registor or not the parameters defined in program CUDA Programming and Performance	8	8349	July 31, 2008
Optimizing a Kernel with a lot of variables - memory allocation. CUDA Programming and Performance	6	736	August 5, 2016
Global memory vs register storage How to force the compiler to use registers? CUDA Programming and Performance	6	4987	July 3, 2009
Question about variables inside a kernel CUDA Programming and Performance	5	2359	January 22, 2008
passing an array to a kenel ? CUDA Programming and Performance	9	13249	June 10, 2009
Kernel requiring large number of parameters CUDA Programming and Performance	14	8643	September 5, 2008
Small const array accessable globally? Is it easy and possible? CUDA Programming and Performance	6	1410	April 16, 2009
Pass arguments through constant memory CUDA Programming and Performance	20	8566	August 11, 2010

Dynamic Shared Memory allocation of more than one array

Related topics