Hi,
I have a kernel which spends about 95% in copiing data from global to shared memory. As I need global synchronization between blocks between each iteration, my kernel only performs a single iteration and is called iteratively from a loop. After each kernel invocation I perform a cudaThreadSynchronize();
Most of the time within the kernel is spend on loading data into shared memory and, I suppose, the data is lost after I exit the kernel? which forces me to reload the whole data set. Is there a solution for this problem? Or do I need to perform block synchronization from within the kernel using atomics? (so I do not have to exit the kernel)
Mfatica is right… But if you have a look at whats in the shared memory the 2nd time you call the kernel you will find that it hasn’t been cleared from when the final block was there executing ( quite likely ).
So I guess in theory you could set #blocks == #SM:s and make sure to allways leave a SMEM based identifier for which data any give block begins to execute is on ( there might be a way to query this ).
This is however unsupported and might not be very stable, it’s probably going to give you lots of grief :-)
If you are running a Fermi-based card, I wonder if the L1 cache is persistent across kernel launches. If so, maybe setting 48KB L1 will help your application a lot.
But I am not sure whether during the next iteration, the kernels will be assigned to the same SMs. If kernel migration happens due to the global scheduler, and the kernel is executed by another SMs, it is going to cause L1 cache misses.
True, but then the data more than likely resides in L2 cache then, which is also full-speed and fully coherent.
I believe manually managing shared memory is only a last resort optimization, not unlikely CPUs today where programmers no longer have huge register files to use as “shared memory” and rely on caches to automatically handle the details for them at a slight cost in efficiency.
Ok, so I found some refernces saying that L2 bandwidth is ~230 GB/s. This yields roughly 230/15 SM ~= 15 GB/s / SM. Not very much compared to L1/shared where you have 16 4 byte banks which is accessed at almost registe speed ( so roughly 1 CC ).
So 16*4 bytes / 1 CC = { Let 1 CC = 1.3^-1 * 10^-9 s = 0.77 * 10^-9 s} => 83 GB/s per SM. So the speed difference there is ~5.5x,
I believe that unless total manual management of the L1 (this is shared memory within a block, right?) is possible as well as pinning of blocks to SMs, exiting the kernel will require to reload all data.
Thats why I am looking into global synchronization without exiting the kernel. Any thoughts on this? This should be a very common problem, isn’t it? Aren’t there any good solutions?
Im not using Fermi. I currently get 1.5 Gflops out of my 130 Gflops. I perform aprox. 10 floating ops per 1 load instruction. I did some testing and I basically have to increase the computation of a factor of 100x to compensate for the load delay and to come close to 100 Gflops.
Is there really a 1000:1 ratio of flops vs. load instructions required to get any kind of efficiency? I am running 484 threads per block.
Is the same data you need to load for all blocks or is it different data? How many blocks does your grid have? How many bytes per block do you want to keep resident?
While you can’t guarantee the same block to be scheduled to the same SM again, you can make the same SMs access the same data again - just distribute the work according to the SM id you query.
Each block uses diffent data, a portion of the grid. After each iteration I have to distribute the border values of the block to the neigbour blocks via global memory. The inner values of the tile that is owned by the block, in theory are local to the block and are not required to be synched. The number of blocks and the total grid size is configurable. I have chosen the size of my tile respectively block to fit into the 16KByte of shared memory available on my device.
Doesn’t that (the data to be accessed) depend on the block? Thanks for the link I’ll have a look at it.
Once the number of blocks exceeds the number of blocks that can run in parallel on the GPU (which in your case most likely is when the total amount of memory needed exceeds the total shared memory present on the GPU), I see no way around using global memory.
You obviously want different blocks to process different data, and the easiest way to achieve that is to make the data depend on the block index in some way (another way would be to use atomic operations in global memory). But you are free to use any other means, which could be a combination of the SM id and atomic operations, possibly also just looping over the work you want to execute on the same SM.
There is one possible way - using register to store the values. CC2.x’s MPs have really large register files. Though, I believe it will be pretty troublesome if we don’t have a working assembler. ptxas messes things up through its own optimiation. Editing cubin binary directly seems like a more viable method.
Of course, I’m just talking… or maybe I could try to make an assembler and have some fun with it…
May you please explain a bit more on how to distribute the work based on the SM_ID retrieved?! Are the threadIdx, and blockIdx mapped to the SM_ID somehow?!