Please help with __shared__ memory different usage than in samples

Dear All,

I’m working currently on a CFD simulation program, and as the program uses large number of accesses to the global memory, I was thinking to change to local memory. My only problem is, that as the program calculates differentials - unlike in the CUDA samples - instead of BLOCK_SIZE×BLOCK_SIZE shared memory, I would need (BLOCK_SIZE+2)×(BLOCKSIZE+2).
In this case how can I copy the data to the shared memory efficiently? (In the sample every thread copies one data, but I cannot do that obviously…)

                                                              Thanks for the help in advance,
                                                                                        Laszlo Daroczy

This is for ghost cell/boundary condition data, I take it? The easiest way to do it is still to have each thread read a single value, but have the boundary threads only read data, not do any calculations. You then overlap each block so that every cell gets covered. It is possible to get close to device to device memory copy speed using that strategy.

This is not only for the boundaries. e.g. if I have a block in the mesh from (k…k+16)×(m…m+16), then in the first cell I need to have for the differentials (k-2,m-2), and for the last cell I will need (k+16+2,m+16+2).

(Second order central differences).

I understand that, and the scheme I have described works fine for any order accuracy central scheme - in 1D, 2D or 3D (when I mentioned boundary threads I mean threads at the boundary of each block irrespective of whether they lie on a physical domain boundary or not, sorry if that is confusing). If you want anything other than natural boundary conditions, you need to compute some ghost cells whose values will satisfy the boundary condition at the true boundary cell.

To expand a little: in 2D, if you are using 16x16 blocks, the difference calculations are done on a 14x14 “inner block”, with threads in the first and last row and first and last column only being used to read values from global memory into the local shared memory copy, and not actually do the differencing.

OK, than now I understand your previous reply :teehee:

Than my question is how do you block the calculations at the boundaries? (with if-else statement because it’s also slower)?

The simplest answer is “you don’t”. I usually add extra cells around the edges of the computational domain and populate those in a separate kernel which can be run as often as you need to so that the boundary conditions you want to apply are approximated to sufficient accuracy (for something like a simple reflective or periodic boundary condition it can be as simple as a memory copy).

Another Edit: Conditional execution in CUDA doesn’t have to be slow. It is only when threads within a given running warp need to follow different execution paths than branch divergence can become a performance issue. If you can structure your code in such a way that there isn’t much branch divergence at a warp (group of 32 threads) level, then code with conditional execution or branch statements is not necessarily a barrier to good performance.

I think, we misunderstood each other again.

First, in case of 2nd order differences in case of 1616 block the inner part can be only 1212.

Second, I was asking how do you prevent in the block the calculating on the boundary of the block? (e.g. 1st cell,1 row)

Do the PDEs your are discretising include higher order spatial derivatives than the second order? I was thinking in only in terms first and second order derivative, which only require a stencil width of three in any direction. But the same idea applies to higher order derivatives or higher order accuracy discretisaton (I have used fourth order central schemes in the same way).

Ok, block boundaries. Just wrap the calculations in an if statement. If there is no “else” then it doesn’t generate branch divergence. The non-participating threads just get masked out and don’t do anything for that section of the code. If you have more unconditional code afterwards, use the __syncthreads() primitive as a synchronization barrier if required (this is important if you are going to do 3D calculations).

Sorry, I meant second terms… :rolleyes:

Thanks, I will try!

Does the portion of grid you do calculations on have to be square ?

If it doesn’t then an approach to reduce uncoalesced reads is to use a grid of say 5 rows by 192 columns, do the calculations on the middle row and 160 columns, then discard the top row of that grid and read in the next row down, and so on.
This turns 4 uncoalesced reads into 2 coalesced ones.
Not suitable for 3D or higher

Actually, the grid is not a square, but the resolution is 2^n × (2+32k). But actually I was not totally sure about the coalesced memory from the manual, if u could help with that i would appreciate! :rolleyes:

I have one other question. On the boundaries, I use different differentiating schemes. (3 different on the inner, 2 on the outer).

My question is that in this case what’s faster? To solve them in different kernels and use i(){} in the other kernel, or to use only one kernel but with if(){}else{} ??

It is pretty straightforward. Each half warp of running threads (ie successive groups of 16 within the block) should read contiguous words from global memory starting on a even 16 word boundary. So if you are reading floats, each sequence of 16 threads should read 16 floats from the same contiguous block of memory, aligned on a 16 float boundary. Which means, amongst other things, you should pad the row or column dimensions (depending on whether you use row or column major ordering) of your storage to an even mutltiple of 16. In practice it is pretty easy to achieve. The CUDA profiler provided some good information on memory coalescing and read performance that you can use to check whether you code is getting it correct or not.

Probably a answer best obtained by trying both and benchmarking them. And sharing the results of the benchmark here ;)

For explicit schemes I use ghost cells, so my codes look more like the first option. For complex, non-linear boundary conditions, it means that the boundary condition kernel takes almost as long or sometimes even longer than the spatial differencing kernel, but I prefer it that way because it provides a lot more flexibility. There are many situations where is it perfectly acceptable to update the boundary conditions periodically, rather than at every time step or stage. When the code is working on an AMR grid, it also allows other alternatives (like using the GPU texture/filtering hardware to interpolate boundary ghost cell values from coarser or finer grid solutions, for example).

Anyway, is it somewhere written with exact number how faster shared, coalesced etc. memory is than global?

“Coalesced memory” isn’t a type of memory in the GPU, it is a global memory access pattern. Basically the GPU has two (or three in the case of GT200 cards) global memory read patterns. Sequential (ie. read only a single word in one transaction) or coalesced (ie. read 16 words in a single transaction). The performance difference between those two is enormous – more than 16 times – for obvious reasons. The best practices guide and the white papers accompanying the matrix multiplication discuss the effect of memory coalescing in detail, but put simply, it can be the difference between an otherwise well designed code hitting 5 Gflops or 100 Gflops.

And between shared and global?

Global memory has a latency of around 400-600 cc (much debate about the actual speed though).

When properly accessing shared memory this should according to the PG be done at register (“instant”) speed.

Local memory is just global memory and is generally for register spilling over. In my opinion you should try avoiding using this at all cost.

In general:

1, Do coalesced reads from off-chip (global) to either registers or shared memory

2, do computations

3, write back to global

If possible, try to make sure that you have room for several blocks on each SM at a time, this helps you hide off-chip latencies.

Hope this is helpful.

You will find all of that in the documents I just mentioned, but the big different between shared memory and global memory is latency. Shared memory has latency estimated to be less than 25 shader clock cycles, global memory greater than 500 shader clock cycles. On the GT200 shared memory bandwidth is a bit higher than shared memory, but the latency is the big difference. Which is why for something like finite differencing, it makes sense to use coalesced memory global memory reads to populate a shared memory cache, and then performance the difference calculations using the contents of the shared memory cache. It maximises global memory read bandwidth and minimizes the overall impact of memory latency. If you do that, you should be able get close the kernel close to peak global memory bandwidth (which is >100 Gb/s on current GT200 based cards).

Do you have a reference to this? I’ve been suspecting that it’s not allways as fast as the PG claims. Perhaps this has been discussed on a thread or divulged in some whitepaper?



I think one of the Volkov/Demmel papers has a pretty decent analysis of the memory structure and latency of either the G92 or GT200, and it showed their estimate of shared memory latency was about the same as the instruction pipeline latency (20-30 cycles IIRC).

EDIT: It was this one, and their numbers were 36 cycles for shared memory and 510 cycles for global memory on a 8800GTX. Whether that is still applicable is anyones’ guess.

Thanks! I had read the paper but skimmed past the 36 cycles part. Volkov is very thorough in his analysis.