Dynamic Shared Memory allocation of more than one array

Hello, I have three questions whose answers I couldn’t find on the documentation or the forum.

  1. I read about how I could pass the size of a dynamically allocated array stored in shared memory as a template parameter.
    However I need to dynamically allocate two arrays in shared memory with different lengths. How do I go about doing this?

  2. My second question deals with simple structs and whether their member variables are stored in registers or in the local memory. I have a struct with a minimal number of primitive member fields which are used frequently by the kernel. How can I make sure that these fields are stored in registers rather than the slow local memory? Does the “register” keyword do anything? I’ve heard that modern compilers ignore it even in regular c++.

  3. Finally, one of my kernels needs a small array of numbers as an argument. I pass the array through constant memory but my kernel is invoked frequently and the copy to constant memory has quite some overhead. I could also copy the array to global memory but I do not expect the performance to be any different.

Is there any way I can pass the contents of the array “by value” avoiding a memcopy? I can’t put the values in a struct because I don’t know the size of the array a-priori.

I would really appreciate your knowledge on these issues.

  1. Allocate shared memory with the combined size of the two arrays. Pass the size of the first array to the kernel as a parameter so the kernel would know the offset of the starting address of the second array.

  2. In the documentation that I’ve read I’ve never seen any mentioning of the register keyword. I would expect the compiler to be smart enough to leave frequently used variables of structures in register. You can confirm the behaviour of the compiler using cuobjdump -sass file. In case the behaviour is not what you want, you can declare the things you want to keep in registers as variables and do the load and store explicitly before and after the computation.

  3. You’re saying that every time the kernel is invoked the argument array would be different? What you are thinking is not impossible, but it would appear quite weird to me. Kernel arguments must have fixed size known at compile time, and to do what you want perhaps you’ll need an argument space as large as the maximal size of the array you want to pass in.

__global__ void kernel(int arg1, int arg2, int arg3, int arg4, int arg5, int arg6, int arg7, int arg8...)



Wouldn’t this be very weird? Copying to global memory looks like a rather straightforward solution and the overhead can be completely hidden if the second kernel launch’s arguments do not depend on the result of the first kernel launch, and if the kernel is not extremely short.

  1. What if the two arrays had different types?

  2. I’d rather not trust the compiler. If anyone knows more about how structs are stored in memory it would be helpful. I could use the variables without the struct but it would be sort of an ugly solution.

  3. That’s what I thought. I was hoping for some sort of trick to pass arrays as parameters, because I have a short kernel which is executed very often and the overhead is noticeable relative to the time it takes the kernel to execute.

__global__ void kernel(int size1, int size2)


	extern __shared__ Type1 Array1[];

	Type2 *Array2 = (Type2*)(Array1 + size1 / sizeof(Type1));

  1. The compiler rarely resorts to local memory. If you pass the struct(s) in from global memory, then the struct(s) will at least get cached at L1. If you declare an array of struct from inside your kernel, and the array cannot fit into the available amount of registers, then the array will be stored in local memory, which is cached at L2 only. If you declare the array from inside the kernel and the array is small enough to fit into the registers that a thread has, then the entire array will stay in registers. In the cases where global or local memory is used, it is without doubt that the compiler must load certain parts of the struct into register in order to do any non-atomic operations. As for how long the the loaded parts can stay in register it certainly depends on your register pressure as well as the frequency of reference.

  2. If I were you, I would try altering the algorithm to make a single kernel launch run long enough.

If you’re on Windows Vista/7, maybe you want to go on Linux instead or switch to the TCC driver.

Yes I’d rather do more work in the kernel but the nature of the program won’t allow that.

Thank you, you’ve been very helpful.