life span of shared memory

cudahacker · April 19, 2011, 4:14pm

Hi,
I have a kernel which spends about 95% in copiing data from global to shared memory. As I need global synchronization between blocks between each iteration, my kernel only performs a single iteration and is called iteratively from a loop. After each kernel invocation I perform a cudaThreadSynchronize();

Most of the time within the kernel is spend on loading data into shared memory and, I suppose, the data is lost after I exit the kernel? which forces me to reload the whole data set. Is there a solution for this problem? Or do I need to perform block synchronization from within the kernel using atomics? (so I do not have to exit the kernel)

thanks

mfatica · April 19, 2011, 4:19pm

Yes, not exiting the kernel is the only way to keep avoid reloading the shared memory.

Jimmy_Pettersson · April 19, 2011, 4:50pm

Mfatica is right… But if you have a look at whats in the shared memory the 2nd time you call the kernel you will find that it hasn’t been cleared from when the final block was there executing ( quite likely ).

So I guess in theory you could set #blocks == #SM:s and make sure to allways leave a SMEM based identifier for which data any give block begins to execute is on ( there might be a way to query this ).

This is however unsupported and might not be very stable, it’s probably going to give you lots of grief :-)

hocheung20 · April 19, 2011, 7:55pm

If you are running a Fermi-based card, I wonder if the L1 cache is persistent across kernel launches. If so, maybe setting 48KB L1 will help your application a lot.

Jimmy_Pettersson · April 19, 2011, 7:59pm

Yes I think that could follow the same principle, would be quite handy.

jarjar · April 19, 2011, 10:53pm

But I am not sure whether during the next iteration, the kernels will be assigned to the same SMs. If kernel migration happens due to the global scheduler, and the kernel is executed by another SMs, it is going to cause L1 cache misses.

Jimmy_Pettersson · April 20, 2011, 7:34am

Ah, that’s right! So if one is to do this kind of tinkering my first idea is in theory better…

hocheung20 · April 20, 2011, 9:46am

True, but then the data more than likely resides in L2 cache then, which is also full-speed and fully coherent.

I believe manually managing shared memory is only a last resort optimization, not unlikely CPUs today where programmers no longer have huge register files to use as “shared memory” and rely on caches to automatically handle the details for them at a slight cost in efficiency.

Jimmy_Pettersson · April 20, 2011, 10:58am

What is the bandwidth of L2 vs L1/shared?

Jimmy_Pettersson · April 20, 2011, 11:18am

Ok, so I found some refernces saying that L2 bandwidth is ~230 GB/s. This yields roughly 230/15 SM ~= 15 GB/s / SM. Not very much compared to L1/shared where you have 16 4 byte banks which is accessed at almost registe speed ( so roughly 1 CC ).

So 16*4 bytes / 1 CC = { Let 1 CC = 1.3^-1 * 10^-9 s = 0.77 * 10^-9 s} => 83 GB/s per SM. So the speed difference there is ~5.5x,

cudahacker · April 20, 2011, 3:07pm

I believe that unless total manual management of the L1 (this is shared memory within a block, right?) is possible as well as pinning of blocks to SMs, exiting the kernel will require to reload all data.

Thats why I am looking into global synchronization without exiting the kernel. Any thoughts on this? This should be a very common problem, isn’t it? Aren’t there any good solutions?

Im not using Fermi. I currently get 1.5 Gflops out of my 130 Gflops. I perform aprox. 10 floating ops per 1 load instruction. I did some testing and I basically have to increase the computation of a factor of 100x to compensate for the load delay and to come close to 100 Gflops.

Is there really a 1000:1 ratio of flops vs. load instructions required to get any kind of efficiency? I am running 484 threads per block.

tera · April 20, 2011, 3:47pm

Is the same data you need to load for all blocks or is it different data? How many blocks does your grid have? How many bytes per block do you want to keep resident?

While you can’t guarantee the same block to be scheduled to the same SM again, you can make the same SMs access the same data again - just distribute the work according to the SM id you query.

cudahacker · April 20, 2011, 4:13pm

Each block uses diffent data, a portion of the grid. After each iteration I have to distribute the border values of the block to the neigbour blocks via global memory. The inner values of the tile that is owned by the block, in theory are local to the block and are not required to be synched. The number of blocks and the total grid size is configurable. I have chosen the size of my tile respectively block to fit into the 16KByte of shared memory available on my device.

Doesn’t that (the data to be accessed) depend on the block? Thanks for the link I’ll have a look at it.

tera · April 20, 2011, 4:53pm

Once the number of blocks exceeds the number of blocks that can run in parallel on the GPU (which in your case most likely is when the total amount of memory needed exceeds the total shared memory present on the GPU), I see no way around using global memory.

You obviously want different blocks to process different data, and the easiest way to achieve that is to make the data depend on the block index in some way (another way would be to use atomic operations in global memory). But you are free to use any other means, which could be a combination of the SM id and atomic operations, possibly also just looping over the work you want to execute on the same SM.

hyqneuron · April 22, 2011, 5:25pm

There is one possible way - using register to store the values. CC2.x’s MPs have really large register files. Though, I believe it will be pretty troublesome if we don’t have a working assembler. ptxas messes things up through its own optimiation. Editing cubin binary directly seems like a more viable method.

Of course, I’m just talking… or maybe I could try to make an assembler and have some fun with it…

ghandurah · April 27, 2011, 11:41am

May you please explain a bit more on how to distribute the work based on the SM_ID retrieved?! Are the threadIdx, and blockIdx mapped to the SM_ID somehow?!

Thanks

Topic		Replies	Views
CUDA: Using shared memory between different kernels.. CUDA Programming and Performance	4	16298	July 21, 2017
Using shared memory in device function and allocate required shared memory in global function CUDA Programming and Performance	2	35	April 14, 2025
No performance inprovement shared mem x global mem CUDA Programming and Performance	5	1170	April 26, 2013
Good way to do sync between/across cores and threadblocks? CUDA Programming and Performance	14	13153	February 8, 2011
kernel function efficiency CUDA Programming and Performance	9	2167	June 28, 2012
mapping blocks to GPU SM's? CUDA Programming and Performance	5	12822	April 28, 2010
What will be happen in the situation CUDA Programming and Performance	9	6247	December 23, 2008
random error when more than 1 active block do you recognize this? CUDA Programming and Performance	10	9257	April 27, 2010
Registers and Shared Memory CUDA Programming and Performance	18	3849	June 16, 2010
help getting shared memory working CUDA Programming and Performance	11	4299	June 12, 2007

life span of shared memory

Related topics