Folks, I’m hoping you can help me with this as I’m a bit of a neophyte when it comes to effectively using shared memory in ways that aren’t commonly addressed in tutorials, &c. (NOTE: what you’ll be seeing below will be in (CUDA) Fortran as that’s what I’m programming in. Luckily, even if you are a pure CUDA C user, I think you’ll be able to figure out what I’m doing (for -> do loops, arrays start at 1, &c.) and nothing I’m doing is really dependent on the language, more the CUDA concept/thinking.)
First, the code I’m looking at is laid out as follows (I can provide more explicit examples of the code if you’d like, but I thought I’d start with the basic structure.). Before moving to CUDA, the code was made into a large loop followed by many loops smaller loops (over k) beneath (which can’t all be fused due to unavoidable loop-carried dependencies in the k loops):
do i=1,2000
do k=1,72
aa(k) = pl(i,k)
...
end do
...
do k=1,72
bb = aa(k)*2.0
...
end do
(add in 20, 30 more of these)
end do
I’ve found through trial and error that my best performance is found by doing the one value of i per thread via:
idx = (blockidx%x - 1) * blockdim%x + threadidx%x
if (idx <= 2000) then
do k=1,72
aa(k) = pl(idx,k)
...
end do
...
end if
launched with a <<<N/64,64>>> style kernel call. (Again, Fortran = column-major and arrays start at 1, thus the slightly different indexing.)
Now, in this code there are a few places I think shared memory would be useful because, at the moment, it isn’t used at all. (The code has been mightily scalarized with unrolling, relooping, and fusing so it’s a register beast.) First there are input arrays that are/can be accessed often:
real, dimension(2000,72) :: pl
where, they accessed with pl(idx,k). Likewise, there are many oft-reused temporary arrays:
real, dimension(72) :: rr,td,tt,rs,ts
that are used in multiple “do k=1,np” loops before passing their values to output arrays. But, these have to be put in per-thread local (i.e., global) memory since their values differ thread-to-thread. (In older versions of the code, they were 2D rr(2000,72) arrays carried around so they differ per-thread.)
My first naive thought of doing this was to create a shared array that I could reference via the threadidx%x. But, I’ve found my best performance starts with 64 threads per block, so the naive way to do this is:
real, shared, dimension(72,64) :: rr_sh,td_sh,tt_sh,rs_sh,ts_sh
where I can access the shared array by “rr_sh(k,threadidx%x)”. Obviously, though, this won’t work, just one of these arrays is larger than 16kB.
However, I keep thinking there must be a way to do this via judicious use of syncthreads and if-statements (loading per-block or some such) that I am just missing. Again, apologies if this is such a simple question you can’t believe I’m asking it, but porting this code to CUDA is a different beast than the usual style where I can use blocked/tiled 16x16 or 256 shared arrays.
I will note, though, for largish problems (do i=1,20000), I am able to get 10-14x speedup using CUDA. So it’s not like this is intractable as it is. I’m just hoping to eek out more performance by using all the resources available.