Shared Memory and Large, Thread-Unique, Input or Temporary Arrays

TheMatt · April 28, 2010, 5:48pm

Folks, I’m hoping you can help me with this as I’m a bit of a neophyte when it comes to effectively using shared memory in ways that aren’t commonly addressed in tutorials, &c. (NOTE: what you’ll be seeing below will be in (CUDA) Fortran as that’s what I’m programming in. Luckily, even if you are a pure CUDA C user, I think you’ll be able to figure out what I’m doing (for → do loops, arrays start at 1, &c.) and nothing I’m doing is really dependent on the language, more the CUDA concept/thinking.)

First, the code I’m looking at is laid out as follows (I can provide more explicit examples of the code if you’d like, but I thought I’d start with the basic structure.). Before moving to CUDA, the code was made into a large loop followed by many loops smaller loops (over k) beneath (which can’t all be fused due to unavoidable loop-carried dependencies in the k loops):

do i=1,2000

  do k=1,72

	 aa(k) = pl(i,k)

	 ...

  end do

  ...

  do k=1,72

	bb = aa(k)*2.0

	...

  end do

  (add in 20, 30 more of these)

end do

I’ve found through trial and error that my best performance is found by doing the one value of i per thread via:

idx = (blockidx%x - 1) * blockdim%x + threadidx%x

if (idx <= 2000) then

  do k=1,72

	aa(k) = pl(idx,k)

	...

  end do

...

end if

launched with a <<<N/64,64>>> style kernel call. (Again, Fortran = column-major and arrays start at 1, thus the slightly different indexing.)

Now, in this code there are a few places I think shared memory would be useful because, at the moment, it isn’t used at all. (The code has been mightily scalarized with unrolling, relooping, and fusing so it’s a register beast.) First there are input arrays that are/can be accessed often:

real, dimension(2000,72) :: pl

where, they accessed with pl(idx,k). Likewise, there are many oft-reused temporary arrays:

real, dimension(72) :: rr,td,tt,rs,ts

that are used in multiple “do k=1,np” loops before passing their values to output arrays. But, these have to be put in per-thread local (i.e., global) memory since their values differ thread-to-thread. (In older versions of the code, they were 2D rr(2000,72) arrays carried around so they differ per-thread.)

My first naive thought of doing this was to create a shared array that I could reference via the threadidx%x. But, I’ve found my best performance starts with 64 threads per block, so the naive way to do this is:

real, shared, dimension(72,64) :: rr_sh,td_sh,tt_sh,rs_sh,ts_sh

where I can access the shared array by “rr_sh(k,threadidx%x)”. Obviously, though, this won’t work, just one of these arrays is larger than 16kB.

However, I keep thinking there must be a way to do this via judicious use of syncthreads and if-statements (loading per-block or some such) that I am just missing. Again, apologies if this is such a simple question you can’t believe I’m asking it, but porting this code to CUDA is a different beast than the usual style where I can use blocked/tiled 16x16 or 256 shared arrays.

I will note, though, for largish problems (do i=1,20000), I am able to get 10-14x speedup using CUDA. So it’s not like this is intractable as it is. I’m just hoping to eek out more performance by using all the resources available.

kbam · April 29, 2010, 12:34am

Hi
Even if shared means fewer threads per block it might work out faster. there are lots of places where I don’t have all the threads in a block (or even warp) doing something.

Who knows for your app having only 8 working threads might be best if it means you can use shared memory. (but if you can, try to make sure you do coalesced reads for copying from global to shared, i.e. 16 threads.

Cheers

TheMatt · April 29, 2010, 5:32pm

Thanks for the advice, but I think the hurt from running 8 threads per block (versus 64 or 256) far outstrips any benefit from shared memory use. In fact, I made shared memory versions of 7 of those oft-accessed temporary variables and it actually got slower than the local memory 8 tpb version. I’m guessing that was due to bad reads?

I’m beginning to think that memory access is not the limiting factor to my speedup. Still, any other advice is welcomed!

TheMatt · April 29, 2010, 5:33pm

[sorry double post due to forum slowness]

cbuchner1 · April 29, 2010, 9:59pm

Maybe Fermi based boards with its L1/L2 caches will provide the speed boost you’re looking for (makes a Jedi mind control gesture and takes commission from nVidia)

TheMatt · April 30, 2010, 11:36am

Oh, believe me, I am waiting for Fermi! The cached local memory might (hopefully, will) do wonders for this code. Of course, until then, there is still Tesla hardware to optimize for. I’m doing what I can with maxregcount and blocksizes and loop-scheduling, but I thought maybe I can use the available shared memory with this code as well.

TheMatt · April 30, 2010, 11:36am

[Sigh. Another double post. These forums and I…]

Topic		Replies	Views
shared memory computation CUDA Programming and Performance	0	2081	September 30, 2010
How can I configure this problem is it too big to fit in shared memory? CUDA Programming and Performance	7	3782	October 14, 2008
Shared memory vs global memory CUDA Programming and Performance	6	3466	April 30, 2007
Dynamic Shared memory CUDA Programming and Performance	3	6122	June 4, 2009
how to use shared memory CUDA Programming and Performance	6	7721	September 5, 2010
shared memory access CUDA Programming and Performance	3	2845	April 24, 2012
numblocks and threads allocation rule? CUDA Programming and Performance	9	13295	March 5, 2011
use of shared memory CUDA Programming and Performance	2	1044	February 16, 2011
Lots of Threads vs. Shared Memory CUDA Programming and Performance	9	8380	February 12, 2008
Performance optimization? CUDA Programming and Performance	6	5288	October 28, 2007

Shared Memory and Large, Thread-Unique, Input or Temporary Arrays

Related topics