Shared Memory and Large, Thread-Unique, Input or Temporary Arrays

Folks, I’m hoping you can help me with this as I’m a bit of a neophyte when it comes to effectively using shared memory in ways that aren’t commonly addressed in tutorials, &c. (NOTE: what you’ll be seeing below will be in (CUDA) Fortran as that’s what I’m programming in. Luckily, even if you are a pure CUDA C user, I think you’ll be able to figure out what I’m doing (for -> do loops, arrays start at 1, &c.) and nothing I’m doing is really dependent on the language, more the CUDA concept/thinking.)

First, the code I’m looking at is laid out as follows (I can provide more explicit examples of the code if you’d like, but I thought I’d start with the basic structure.). Before moving to CUDA, the code was made into a large loop followed by many loops smaller loops (over k) beneath (which can’t all be fused due to unavoidable loop-carried dependencies in the k loops):

do i=1,2000

  do k=1,72

	 aa(k) = pl(i,k)

	 ...

  end do

  ...

  do k=1,72

	bb = aa(k)*2.0

	...

  end do

  (add in 20, 30 more of these)

end do

I’ve found through trial and error that my best performance is found by doing the one value of i per thread via:

idx = (blockidx%x - 1) * blockdim%x + threadidx%x

if (idx <= 2000) then

  do k=1,72

	aa(k) = pl(idx,k)

	...

  end do

...

end if

launched with a <<<N/64,64>>> style kernel call. (Again, Fortran = column-major and arrays start at 1, thus the slightly different indexing.)

Now, in this code there are a few places I think shared memory would be useful because, at the moment, it isn’t used at all. (The code has been mightily scalarized with unrolling, relooping, and fusing so it’s a register beast.) First there are input arrays that are/can be accessed often:

real, dimension(2000,72) :: pl

where, they accessed with pl(idx,k). Likewise, there are many oft-reused temporary arrays:

real, dimension(72) :: rr,td,tt,rs,ts

that are used in multiple “do k=1,np” loops before passing their values to output arrays. But, these have to be put in per-thread local (i.e., global) memory since their values differ thread-to-thread. (In older versions of the code, they were 2D rr(2000,72) arrays carried around so they differ per-thread.)

My first naive thought of doing this was to create a shared array that I could reference via the threadidx%x. But, I’ve found my best performance starts with 64 threads per block, so the naive way to do this is:

real, shared, dimension(72,64) :: rr_sh,td_sh,tt_sh,rs_sh,ts_sh

where I can access the shared array by “rr_sh(k,threadidx%x)”. Obviously, though, this won’t work, just one of these arrays is larger than 16kB.

However, I keep thinking there must be a way to do this via judicious use of syncthreads and if-statements (loading per-block or some such) that I am just missing. Again, apologies if this is such a simple question you can’t believe I’m asking it, but porting this code to CUDA is a different beast than the usual style where I can use blocked/tiled 16x16 or 256 shared arrays.

I will note, though, for largish problems (do i=1,20000), I am able to get 10-14x speedup using CUDA. So it’s not like this is intractable as it is. I’m just hoping to eek out more performance by using all the resources available.

Hi
Even if shared means fewer threads per block it might work out faster. there are lots of places where I don’t have all the threads in a block (or even warp) doing something.

Who knows for your app having only 8 working threads might be best if it means you can use shared memory. (but if you can, try to make sure you do coalesced reads for copying from global to shared, i.e. 16 threads.

Cheers

Thanks for the advice, but I think the hurt from running 8 threads per block (versus 64 or 256) far outstrips any benefit from shared memory use. In fact, I made shared memory versions of 7 of those oft-accessed temporary variables and it actually got slower than the local memory 8 tpb version. I’m guessing that was due to bad reads?

I’m beginning to think that memory access is not the limiting factor to my speedup. Still, any other advice is welcomed!

[sorry double post due to forum slowness]

Maybe Fermi based boards with its L1/L2 caches will provide the speed boost you’re looking for (makes a Jedi mind control gesture and takes commission from nVidia)

Oh, believe me, I am waiting for Fermi! The cached local memory might (hopefully, will) do wonders for this code. Of course, until then, there is still Tesla hardware to optimize for. I’m doing what I can with maxregcount and blocksizes and loop-scheduling, but I thought maybe I can use the available shared memory with this code as well.

[Sigh. Another double post. These forums and I…]