shared memory

codercat · April 21, 2007, 5:28am

I am wondering if a kernel uses a small part of shared memory for itself when it is called or if the entire 16K per multiprocessor is available for programmer use. I ask because I would like to use all of shared memory (down to the last byte) to store specific data for use in my calculation but when I look at the .cubin file there seems to be 64 bytes being used for something which I am not declaring. Is this standard or should I be looking for something that may be using shared memory that I am unaware of?

I am able to shrink my shared memory use and just run the kernel more times, but I would prefer to use as much as possible. It also seems to take longer to run more kernels even though each one is doing less work. (I run the kernel 4 times as often and each does 1/4 the work, but overall everything takes about twice as long even though occupancy is better).

edit
Another quick question. Are float4’s more efficient than float3’s to use on the GPU? I converted my code to use float3’s instead of float4’s to save some room in memory and each kernel call suddenly took twice as long to run.

Thanks.

tachyon_john · April 21, 2007, 6:15am

Kernel function parameters are passed via shared memory, and perhaps other runtime items I’m not aware of.

Cheers,

John Stone

I am wondering if a kernel uses a small part of shared memory for itself when it is called or if the entire 16K per multiprocessor is available for programmer use. I ask because I would like to use all of shared memory (down to the last byte) to store specific data for use in my calculation but when I look at the .cubin file there seems to be 64 bytes being used for something which I am not declaring. Is this standard or should I be looking for something that may be using shared memory that I am unaware of?

I am able to shrink my shared memory use and just run the kernel more times, but I would prefer to use as much as possible. It also seems to take longer to run more kernels even though each one is doing less work. (I run the kernel 4 times as often and each does 1/4 the work, but overall everything takes about twice as long even though occupancy is better).

edit

Another quick question. Are float4’s more efficient than float3’s to use on the GPU? I converted my code to use float3’s instead of float4’s to save some room in memory and each kernel call suddenly took twice as long to run.

Thanks.

[snapback]187850[/snapback]

prkipfer · April 23, 2007, 5:05pm

For the shared mem usage of the kernel, look in the .cubin file. This figure includes arguments and static allocs. Then add the shared mem bound to the extern pointer on kernel launch (if you do use this feature). That is the sum that must stay under 16k.

For the float3: It does not make sense to use these vectors (unlike previous gen hardware) on the 8800 because of the new multiprocessors if you don’t have to. The slowdown is probably due to non-coalescing r/w or multiple 32bit r/w operations instead of a 128bit in the case of float3. Check your variable alignments.

Peter

wumpus · April 23, 2007, 5:18pm

It’s good to not use the full 16k of shared memory in your kernel, because it means multiple blocks can be scheduled to a multiprocessor at once, thus hiding latency for memory reads and such.

codercat · April 24, 2007, 2:19am

I’ve actually taken this into account for my calculations and done some testing with different numbers of blocks, but the same number of total threads, being launched. Right now, it actually appears (by looking at the CUDA profiler output) that it is fastest for me to lauch only one block per multiprocessor with 256 threads each than it is to launch four blocks per multiprocessor with 64 threads each. This occurs even though the second configuration has 50% occupance instead of 33% occupancy like the first. Can anyone explain why this may be occuring?

I have some more tests to do, and I’m not currently doing my entire calculation, but I will eventually pick the configuration which consistantly runs the fastest.

Thanks for the tip with float3’s, the alignment issue causing slowdown makes perfect sense. I wanted to use float3’s to save some space but the performance hit makes it not worth it to me.

Topic		Replies	Views
question about shared memory why 16K does not work? CUDA Programming and Performance	6	7474	January 1, 2009
where is the another 32 byte shared memory CUDA Programming and Performance	2	6050	July 21, 2009
shared memory problems size of shared memory allocated affects execution time? CUDA Programming and Performance	2	750	June 20, 2011
Shared memory is lifetime of block? CUDA Programming and Performance	6	3019	May 17, 2007
Shared memory problem CUDA Programming and Performance	10	3976	April 20, 2010
Shared memory performance issue CUDA Programming and Performance	5	4507	January 10, 2009
How can I configure this problem is it too big to fit in shared memory? CUDA Programming and Performance	7	3770	October 14, 2008
shared memory performance kernel execution timings with one block CUDA Programming and Performance	3	3180	May 6, 2007
Why not full occupancy? CUDA Programming and Performance	2	993	November 17, 2012
Not enough shared mem CUDA Programming and Performance	5	5805	November 3, 2009

shared memory

Related topics