Newbie - Need to use shared mem?

I have a large array and I’m going to be reading pairs of numbers and performing some operations. The elements of each pair could be anywhere in the array, they are not adjacent.

If I created a bunch of blocks and threads to perform each independent operation, will the GPU (GTX280) sequence the execution of the threads so that there may be a long delay initially (400cycles) for first reads from global mem, but after that, threads will be executed as the memory arrives with little incremental delay because the read was started previously? Or, do I need to manually move chunks of memory into shared memory for each block, perform operations, and then repeat until done?

I know how I can code it with shared memory if that’s what I need to do, but I would rather not get in the way of the hardware if it was already going to more efficiently stage the reading of memory and execution of code for each pair of elements.

Thanks in advance for your help.

Do you know the pairs you are making before runtime or is there an algo which can decide that… ? cause if there is then… you re-arrange for sequential access them and use a float2 pointer maybe to pair them up…



The pairs are known in advance. One element of the pair can be accessed sequentially but the 2nd element of the same array could be any element. It’s a neural network, here’s a simplified example (there is more to the control structures, but this gets the idea across):



NeuronConnections[0]=27 (neuron 27 is input to this (don’t worry about which one) neuron)

NeuronConnections[1]=25,235 (neuron 25,235 is input to this neuron)

NeuronConnections[2]=-1 (end of this neurons input)

This ain’t the Cell or the Larrabee (thank God). Just access global memory, and the hardware will take care of the rest. In Cell and Larrabee you in fact do need to set up DMA transfers into SRAM for latency to get hidden. In CUDA, massive hyperthreading hides it automatically.

Of course, issuing individual word-sized requests to DRAM will still be very inefficient, but if your data access is completely random then that’s how it is.

I love the Cell Alex, it gives you much more control, comprehensive tools, more robust. Hope I don’t get kicked out for saying this :)

Hmm, haven’t worked with it, and to be honest would not look forward to it. Much more control means in my book, much more work to do (amongst others when upgrading hardware). That is in my opinion the elegance of CUDA, as long as you have lots of blocks in your kernel call, the only thing to do is recompile…

What do you mean by more robust btw?

My original plan was to use the cell. It took a few years to get the code where it is (which is that it now needs some real speed to get through enough generations of creatures to see decent results). I figured I could buy 10 PS3’s, but when I looked at it in detail, it just wasn’t going to gain me enough due to the memory access issues (SIMD operates on adjacent memory only). I was about to buy a 4 proc multi-core xeon server when I stumbled on CUDA and noticed the relaxed adjacent memory rules. If this GTX280 works out I’m going to buy a few more or look into the higher end stuff.

I really have multiple arrays, and I can re-arrange the data stucture to improve processing. As it is right now, I have the following (all integers):

Array number 1 - Sequential access

Array number 2 (which is 10x array1) - Sequential access, but one thread might read 4 elements and the next thread might read 32, but still all sequential

Array number 3 (same size as array1) - random access

Array number 4 (same size as array1) - sequential acces write only

Now that I’ve got my test code working, I’m trying to gain some understanding so I can do some optimization. So here are a couple questions if you have the time:

  1. Array number 1 - If sequential per thread, CUDA will coalesce the reads into 128 bit reads, correct?

  2. Array number 2 - It’s sequential but not exactly in line with threads, will this all get coalesced or do I need to dig in and really understand the coalescing rules to make it perform well?

  3. Array number 3 - it’s random but each element will (possibly) get read multiple times by different threads. If I can get this entirely into shared mem or stage it in, then it seems like it would perform well. I’m thinking I might be able to have each thread read in X elements into an array in shared mem at the start of the kernel and then during the 2nd part of the kernel, each thread will be reading from that array in shared mem. But how do I create this array in shared mem that is shared across threads? if I define an array in the kernel, say “shared int TestArray[4096]”, it seems like that would be local to the thread, do I need to allocate the array outside of the kernel and pass in a pointer? If so, how do I tell it to allocate shared mem instead of global mem?

Thanks in advance for any help.

raftpeople, keep in mind the point regarding DRAM is not so much “sequential vs random”, in the traditional sense, but what happens during a single instruction (across the 32 threads of a warp).

Yes, on GT200. The card is flexible, it’ll see what accesses can be bunched up and bunch them. On older cards, no, you need threads to exactly line up.

Do you mean that in a given instruction, multiple threads will access the same location? Again, on GT200 this should be optimized as best as it can be. But if you mean during different instructions, then yeah, might think about shared mem.

I know, another user was also confused by this syntax, and I think I might have been too when I started out. No, any shared variable is truly shared across all threads in a block, strange as it may look in C.

liv, I’m sure the Cell does have better, more polished tools. (NVIDIA will get there too… some day. Though NVIDIA’s buggy tools are wonderfully user-friendly.) But architecturally it’s all wrong. The GPU model of massive hyperthreading is supreme. It acts to destroy the greatest problem of modern computing: latency. (Whether it be DRAM latency, pipeline latency, branch mispredict latency.) The other great innovation, SIMT (vs SIMD), is equally divine. The Cell was made by people who were stuck thinking “in the box” of traditional design. Then again, the Cell is obviously much better at serial code. (It tries to be parallel, but it actually shines at serial. It’s an odd hybrid, that.) But the sort of algorithms that can run on a GPU, run with breathtaking elegance.

Btw, E.D. Riedijk, you know better than anyone that there’s plenty to think about when optimizing for GPUs. But, yes, there’s much less to code. (E.g., you have to contemplate coalescing rules, but you don’t have to write and debug extra DMA code.) In the end, everything is under your control and deterministic.

So defining the array that way works (which is good), but I shouldn’t define temp/work vars as shared because they aren’t unique to thread. For example:

shared int tempVar;

tempVar=(big complex calculation * threadIdx.x etc. etc.)

Instead I should just define:

int tempVar;

and it will be placed in a register? I was worried about storing tempvars globally due to slow access.

Yes, exactly. Also, registers are slightly faster than shared memory, and more plentiful.

Great, thanks for your help.

I use the Cell in image processing, which is where I think it outperforms the GPUs. It was more work to get started, but the Cell programming model lends itself very well to “templateization”. One can guarantee concurrency between processing and memory transfers with multibuffered DMAs. With CUDA, I’ve had cases in which a minor tweak produced a significant increase or decrease in performance. Also, small changes in the data (like the data type) may require extensive changes in the memory access patterns to keep the GPU happy. The GPUs have a very low cost, which is why we ended up adopting them for our products.

lol, you mean “boilerplate code”?

That’s true. With everything automated, running with minimal code, but requiring the right code to run well, you get plenty of situations where a small change affects performance 10x. But it’s all deterministic, and there’s only a handful of factors to track.

GPUs are also much faster. They’re faster in theoretical FLOPs by a few times, but they’re much more efficient. Each instruction has 1-cycle latency, each read to SRAM has 1-cycle latency, even reads to DRAM have 1-cycle latency.

But the cost of 1-cycle DRAM latency is that you have to consider coalescing. Coalescing can break with small changes in code (especially on G80, G200 is much more flexible), and may require changes to the access pattern. But it gives you frikkin 1-cycle latency.

Anyway, your organization didn’t pick GPUs because they’re cheap. Don’t denigrate them like that. You picked GPUs because they’re the superior architecture. Not just for your application, but for everyone’s. GPUs will continue to get faster and better, to benefit from billions of dollars in investment, to be used by millions of people. Eventually, they will exist on the same die as the CPU. (AMD bet the company on it.) NVIDIA’s CUDA may be a stop-gap solution, but the GPU paradigms of parallel processing – SIMT + massive hyperthreading – are here to stay. The Cell is a dead-end. A serial architecture masquarading as a parallel innovation, taking the worst from both worlds.

OK, boilerplate code :)

It’s deterministic as long as the compiler/optimizer doesn’t get in your way. I often had to redesign a CUDA kernel to use fewer registers, and the result is never intuitive. The lack of tools is also a problem, one has to use someone’s personal project (decuda, kudos to his author) to see what’s really going on.

I found the Cell to be more of a “what-you-see-is-what-you-get” type of architecture.

In our apps, a PS3 Cell outperformed a G80 (Tesla C870) by factors ranging between 3x and 7x. I think one of the major differences was in the fact that our apps employ mostly large footprint operators which cannot reside in the shared memory at one time. I’m sure someone more experienced could squeeze a bit more throughput, but, as you put it, there’s only a handful of factors to track. It is also true that the G200 should close/eliminate the gap, but it’s the largest chip ever built. I’m not sure Nvidia can afford to take the brute-force approach (bigger is better) for much longer and needs to start reinventing. G200’s direct competitor, the RV770, is smaller and has more throughput.

Thanks for giving us credit Alex. To their merrit, Nvidia has a good upgrade roadmap, so an Nvidia GPU guarantees a passive avenue to performance improvements for one’s products. The Cell’s future generations seem to be based on niche spinoffs like Toshiba’s SpursEngine. However, it’s a chip designed for a gaming platform, and these have a longer shelf life than most consumer products. ON the other hand, CUDA seems to be a good launchpad to SIMT programming and the OpenCL looks a lot like CUDA. So you see, I’m not all negative about the GPU’s. But, wheather the Cell is a dead end or not, its paradigm (fewer compute units, but with much more comprehensive local resources) seems to have a new lease on life in the Larrabee. Now I’m really looking forward to that.

Trying to use fewer register, you’re right, is an opaque procedure. But it’s also with few benefits. It’ll never give you the sort of order-of-magnitude boost that fixing coalescing, bank conflicts, etc will, so I suggest focusing on those and not worrying about occupancy (unless it’s really low).

In any case, I’ll be the first to admit the tools are imperfect. We don’t even have a debugger! (Although emulation mode is a very clever workaround.)

Yes, OpenCL is almost the same thing. Hopefully we will soon have a vendor-neutral framework for programming GPUs, and one that has more polish than CUDA. Basic things like a debugger and line-by-line profiler will bring back the sort of control and transparancy you saw with IBM’s tools. I think the best thing about the GPU space is the competition and dynamism.

Using shared mem effectively is perhaps the greatest skill a CUDA programmer can master. If you don’t use it, you really can’t hope for good performance. Utilizing the puny 16KB well sometimes means conceiving really clever caching schemes. Also, using the larger 64KB register file as a cache is often an even better strategy compared to using smem. If you’d like, I can help your company figure these things out. I’m willing to bet [my wages] I can get you running faster than the Cell.

You’re observant. The larrabee is much more like the Cell than a GPU. And that’s why, as you might expect, i’m not too thrilled about it either ;) (But I expect Larrabee 2 will be much more like a true GPU)

Hey, at least RedHat 5 has a debugger on 32 bit…

Lets hope the other distributions and windows follow soon.

I have a quick question about memory that I thought I’d hitch onto this thread. What if I needed each thread to do operations on every piece of data in a large array (ie 512x512)? What would be the best approach memory-wise? Is putting the array in texture memory a good idea? Or maybe something with shared memory? If this isn’t a very simple question, I’ll make a new topic.

Each thread reads the entire array? You’ll certainly want to use shared memory. Texture memory probably won’t be useful (it helps in very limited circumstances).

You probably want to start your own topic, however.

If your threads go through that memory in the same way (they all access the same element at the same time in your code) shared memory is probably the best (just load 8 kB at a time and process it (or 12 kB, you cannot reach 16 kB because kernel parameters take up some space from the 16 kB)).

Texture memory looks also not so bad, but if I remember right texture memory has roughly the same latency as global memory (it just does not use up memory bandwidth) as far as I have read. At least, I think that was the final conclusion in a thread in the past. Shared memory has almost no latency.

If you can post some example code (probably in a new thread) people are likely able to say which approach is best.

It’s not so much that the texture cache has the same latency as DRAM, it has the same bandwidth. That’s the killer. You can access the tex cache simultaneously with reading DRAM and get “double the bandwidth” (say, 200 GB/s), but it’s still much less than the 1 TB/s of bandwidth you can easily extract from shared memory.

The real point of the texture cache is to side-step coalescing requirements. When you can’t generate coalesced reads, but there is still a lot of locality to your DRAM accesses, the cache will let you transfer from DRAM in blocks then transfer to registers in pieces, the same as what’s always been done on CPUs.