shared memory

I am wondering if a kernel uses a small part of shared memory for itself when it is called or if the entire 16K per multiprocessor is available for programmer use. I ask because I would like to use all of shared memory (down to the last byte) to store specific data for use in my calculation but when I look at the .cubin file there seems to be 64 bytes being used for something which I am not declaring. Is this standard or should I be looking for something that may be using shared memory that I am unaware of?

I am able to shrink my shared memory use and just run the kernel more times, but I would prefer to use as much as possible. It also seems to take longer to run more kernels even though each one is doing less work. (I run the kernel 4 times as often and each does 1/4 the work, but overall everything takes about twice as long even though occupancy is better).

Another quick question. Are float4’s more efficient than float3’s to use on the GPU? I converted my code to use float3’s instead of float4’s to save some room in memory and each kernel call suddenly took twice as long to run.


Kernel function parameters are passed via shared memory, and perhaps other runtime items I’m not aware of.


John Stone

For the shared mem usage of the kernel, look in the .cubin file. This figure includes arguments and static allocs. Then add the shared mem bound to the extern pointer on kernel launch (if you do use this feature). That is the sum that must stay under 16k.

For the float3: It does not make sense to use these vectors (unlike previous gen hardware) on the 8800 because of the new multiprocessors if you don’t have to. The slowdown is probably due to non-coalescing r/w or multiple 32bit r/w operations instead of a 128bit in the case of float3. Check your variable alignments.


It’s good to not use the full 16k of shared memory in your kernel, because it means multiple blocks can be scheduled to a multiprocessor at once, thus hiding latency for memory reads and such.

I’ve actually taken this into account for my calculations and done some testing with different numbers of blocks, but the same number of total threads, being launched. Right now, it actually appears (by looking at the CUDA profiler output) that it is fastest for me to lauch only one block per multiprocessor with 256 threads each than it is to launch four blocks per multiprocessor with 64 threads each. This occurs even though the second configuration has 50% occupance instead of 33% occupancy like the first. Can anyone explain why this may be occuring?

I have some more tests to do, and I’m not currently doing my entire calculation, but I will eventually pick the configuration which consistantly runs the fastest.

Thanks for the tip with float3’s, the alignment issue causing slowdown makes perfect sense. I wanted to use float3’s to save some space but the performance hit makes it not worth it to me.