GT300 and CUDA 3.0

I read that GT300 will be not only cooler 40nm tech, but also has new arhitecture - The GT300’s architecture will be based on a new form of number-crunching machinery. While today’s NVIDIA GPUs feature a SIMD (single instruction multiple data) computation mechanism, the GT300 will introduce the GPU to MIMD (multiple instructions multiple data) mechanism. This is expected to boost the computational efficiency of the GPU many-fold. The ALU cluster organization will be dynamic, pooled, and driven by a crossbar switch. Once again, NVIDIA gets to drop clock-speeds and power consumptions, while achieving greater levels of performance than current-generation GPUs.

Does it change CUDA programming a lot?

I’m pretty sure I’ve heard nv employees say they don’t feed speculation on future products; usually this is disallowed by a company’s NDA (disclosure of non-public material information about future products).

Ask us again when we’ve announced a product. We’re not going to answer anything about anything unannounced.

I have read speculation about dynamic warp formation (that is how I remember it). Where divergent warps get split up and a new warp is made out of the same-branch threads of old warps. This might make some assumptions that we do now invalid I guess. But given the fact that it will probably be until late this year before we see any of this, I think everyone can sleep quietly at night for now ;)

At a recent conference, it was prominently announced that NVIDIA Doesn’t Do Roadmaps. I suspect that this is due to the fact that GPUs are ‘trickling up’ to HPC from consumers, rather than high end hardware trickling down to consumers (as has happened with CPUs). People buying large supercomputers do so in the full knowledge that by the time their machine is fully operational, faster chips will be available. OTOH, teenaged boys (to use NVIDIA’s own description of their primary market :D ) will probably wait three months if they know a better card is coming out.

The nice part of this approach is that you don’t plan for PowerPointware (Itanium/Larrabee etc. - apologies for singling out Intel here). However, it does make drawing up plans for the future rather difficult.

It would be nice if N-Vidia would release more info on their up coming GT300 Processor. I would like to start designing my algorithms so they fit well on the new chip. I would also focus my “cuda/ptx” learning so that I don’t learn something that is going to be outdated in a few months.

Probably they will support slim MIMD at the GPU level and inter-multiprocessor messaging (just my expectation :-) ). However, the multiprocessor itself may be still SIMD to assure compatibility with the current architecture. The complicated MIMD multiprocessor would make the GPU become a de facto CPU and be much more expensive definitely.

I think you will be fine learning CUDA now on the current architectures. If anything, whatever kernels you build today will need little (if any) tweaking to also get good performance on the GT300 architecture. It is more likely that the new features there will allow CUDA to handle kernels that could not be efficiently written today due to various communication issues and such.

The hardest part of porting an app to CUDA has been, is, and will be actually parallelizing sections of your calculations. Once that is done, tuning for maximum performance on the device is relatively simple.

Yeah, if you drop the CUDA terminology for things, you realize that GT200 already has MIMD features. Using Larrabee-style terms, the GT200 has 30 RISC cores, each a 32-wide SIMD processor (implemented with 8 pipelined FPUs). Moreover, each core is 24-way hyperthreaded (since each active warp can be running a different instruction). The only thing missing is a ring bus for communication between the cores, and maybe some more cache.

The software side of CUDA hides the MIMD from you, by requiring your kernel to run on the entire chip, rather than some subset of it. If NVIDIA wants to make CUDA more MIMD-friendly, they already have the mechanism to do so: CUDA streams. Currently streams are really only good for overlapping computation and memory copies. However, if you could bind a CUDA stream to a particular number of multiprocessors, you could subdivide the GPU and more easily run completely different kernels on the different multiprocessors. Then all you need is a way to insert a “join” event into two streams for synchronization, and (hardware permitting) some sort of way to quickly exchange data between streams, bypassing global memory, and you have a winner.

I have no idea if this was the long term vision behind adding streams to CUDA, but the abstraction strikes me as an easy way to grow CUDA in a MIMD direction without huge changes to the CUDA programming model. (Insert usual disclaimer regarding speculation about someone else’s software here)

Here is a preview of what the actual silicon will look like. Note the fast communications rails all around the edges:

http://en.wikipedia.org/wiki/File:Disneyla…iew_in_1956.jpg

Whee! __syncthreads_with_roller_coaster()!

If you want to learn something that will be around for the GT300, it’s probably better to start learning OpenCL. If you want to program your GT200 or GT200b efficiently now, learn CUDA (it will help you with OpenCL anyway).

I am surprised at the amount of people who pretend that programming efficiently with a pure SIMD style is easy. Of course, if you have a poorly written CPU code, you will see amazing performance improvements. But if you have an already well-tuned program, it will require you significant work on CUDA as well. For instance read the paper by VVolkov at Supercomputing 08 and you will see that the kind of optimizations he does are not something that take only a couple of hours to figure out.

And imho, if the GT300 was entirely MIMD, yes it would break a lot of CUDA optimizations. I really hope that the “SIMD at the lower level, with MIMD possiblities” is what is shipped in the final product.

Er, what? CUDA’s not going anywhere.

SIMD is clearly not the entire story (though it IS fun to parameterise an algorithm with the SIMD width of CUDA warps and then run it on the NEC SX9)

Until I’ve actually gotten my hands on a GT300 or whatever this unannounced product will be called, I don’t care. To quote David Kirk “we don’t do roadmaps”. Speculation is futile.

Pretty much official NVIDIA slides on CL vs CUDA are here: http://www.cse.unsw.edu.au/~pls/cuda-workshop09/
CL = driver API, C4CUDA = high-level
another notion is that the CUDA docs now officially use the language of “C API” vs “C++ high level API”. Might be accidental, but I’ve seen less strong evidence :)