NVISION highlights?

NVision started today, but most of the CUDA tech talks are Tuesday and Wednesday.
Anyone have any recommendations for the not-to-miss sessions?
Schedule is here: [url=“http://speakers.nvision2008.com/agenda/”]http://speakers.nvision2008.com/agenda/[/url]

I’m particularly looking forward to the computational biology sessions… it’s not something I"ve worked with so it’s a chance to finally learn something about them.

I will be going to the talks in SJCC J, as they seem most interesting. Keynote was fun. Unfortunately raytracing on GPU was a NVSCENE-only talk this morning, I was very interested in that…

The raytracing session is on Wednesday from 12:30 to 1:30.

It is indeed in the game track (why!?).

But most developers have signed up for the whole conference… with the free Quadro they give you, the signup fee isn’t so painful anymore.

No, I meant the raytracing on GPU from the NVSCENE (really ;)) it was at 11:00 am. I was originally regsitered at early bird rate, but also happily paid the 200 extra :D

I will certainly be going to the talk Wednesday. It is an impressive demo, but what do you want with 4x C1060, so basically a S1070 of processing power…

I am getting fed up a bit seeing again slides with a debugger mentioned on them. I have seen slides with a debugger mentioned on them last year already. Maybe it is time that we can actually use it? Interesting to hear btw that CUDA 2.1 will have the multicore CPU compiler target. Also interesting that C++ is coming.

For those not at the technical keyotes on day 1, there were a few “teasers” listed for upcoming CUDA releases.

  1. nvcc --multicore for running CUDA on multi-core CPUS will be in CUDA 2.1. The wait for this release shouldn’t be too long…
  2. C++ code in kernels
  3. Fortran CUDA code (I don’t know if this is related to Flagon, or if NVIDIA is modding a FORTRAN compiler so that CUDA kernels can be written in FORTRAN… it was just a teaser on the timeline that wasn’t discussed).
  4. I forget the exact wording, but something like “Advanced multitasking” which indicates potentially multiple simultaneous kernel launches. Or perhaps it could mean concurrent CUDA and graphics use of the GPU.

And don’t forget the talk before David Kirk where someone was extrapolating GPU performance & coming up with 10+ TFLOPS & 1+ TB/s for 2013. I find the last one highly unlikely though…

Nvision is over… MisterAnderson, I met you for 1 minute but never found you again to pick your brain! Ah well.

Interesting conference… like any technical conference, some sessions I looked forward to were boring, other sessions I figured would be filler turned out to be very interesting. But most interesting overall was likely the networking, finding other tech geeks and arguing passionately about shared memory or register pressure workarounds.

One (nameless) Nvidia engineer and I had a discussion, he claimed that the G200 architecture not only doubled the number of registers, but also doubled their size to 64 bits, so you could (if you wanted to) use registers to hold more data than before, though you’d have to unpack them to actually access the two 32 bit ints or floats. I was surprised but also intrigued because every byte of storage can be very useful! But then I mentioned this to another Nvidia employee, he just laughed. I suspect that engineer #1 was confused.

A small tech detail I learned, the G80 and G200 hardware is capable of 16 thread warps. This is NOT exposed in CUDA, but IS used in the graphics drivers. It was abstracted from CUDA because of the minor bang-for-buck impact and looking ahead to future hardware. Just my speculation, probably using a smaller warp of 16 would mean that it would cut the maximum number of threads in flight in half, increasing the chances of register stalls, latency issues, etc, and the graphics driver hardcore coders work at a finer level and can tweak their code to work around it… but for more general CUDA it’s better to use the larger warp size to make such hard-to-measure problems less common.

EDR: at the computational biology roundtable, the fabled “real soon now” debugger was also discussed with longing desire.

Very fun conference, sometimes the smaller venues are the best places to meet and geek, SIGGRAPH is just too busy. Thanks to Nvidia for organizing it!

Woo! Sign me up!

Oh! Thats so nice of you! If possible, post the highlights of rest of the sessions as well!!

C++ code in kernels – Hmm… Unles they give support for “call” instruction, this would be a dream.

Even if hardware does NOT support “call” – the compiler can still use a return-address register probably coupled with the register-spill method in SPARC cpus to get this working!! They would need one return-address register for 1 Warp. 24 RARs for 1 MP. i.e. 24 registers to support a call-depth of 1 per MP. Taking 192 registers from the MP’s 8192 pool, the compiler could directly support a call depth of 8 – which looks like a good enough amount! The total number of registers available for a kernel will get decreased to 8000 per MP!!! Further if the number of bits required to store a return address is less than 32 – this value can still be boosted!

By using #pragma to enable/disable this support - the compiler can still remain backward compatible!

But Still, I just cant visualize C++ in CUDA kernels… How will it look like?

cuda_kernel_object.launch() ???

I don’t have any idea what it will look like either. I guess we’ll just have to wait and see.

Highlights of other sessions? There is way too much to even remember: most of it detailed information about the speaker’s specific research.

Let’s see: there was John Stone on computational biology. He talked about various methods for computing energy on a grid and their tradeoffs and performance.

I forget most of the names, but next was a talk on using CUDA for astronomy simulations using fortran. This was from the group that developed Flagon. They took some extremely complicated tree-based algorithms and mapped them to CUDA getting ~30x speedups (although that wasn’t done in FORTRAN, it was a custom kernel).

Another talk was on FLAME which is aiming to be a LAPACK replacement. They are doing some very cool work where any “matrix algorithm” is expressed in a language and platform independent manner. Then FLAME steps in and maps it to whatever hardware you want to run on. The talk was mainly about the CUDA implementation of FLAME which has a faster SGEMM than CUBLAS by blocking the calculation and using CUBLAS’s SGEMM on small blocks. The performance of several higher level matrix routines such as Cholesky were discussed.

Umm, there was another very good talk on the details of floating point precision and techniques for using mixed precision in GPU kernels to keep all the accuracy with little or no performance penalties.

Another dense talk that I didn’t understand much at all discussed optimizing a 3D FFT calculation. I think it was 3x faster than CUFFT, but my memory could be off. Later in that talk, they discussed the details of a huge GPU cluster they are building (100’s and 100’s of Tesla GPUs) using windows compute server 2008 (ick!). Of all the GPU clusters I’ve seen mentioned, this is certainly the biggest.

What am I missing… Someone from NVIDIA gave a talk on computer vision using CUDA. The biggest thing I remember about that talk was that automotive companies are looking into putting many cameras on all sides of the vehicle and using CUDA to process the data. One of the things they can do with it are to warn the driver of an impending collision.

I think I’m still missing a talk or two.

Let see, the Tokyo Institute of Technology talk (which had the fast 3D FFT subtalk) also was interesting in that their supercomputer (Tsubame) gets most of its flop speed from Clearspeed accelerators. The speaker pretty much said that GPUs crush those 2-year-old accelerators both in performance and in price by more than an order of magnitude each. It gave me the impression that world-class top-500 supercomputers may be designed as custom nodes with loads of bandwidth and RAM, but then use GPU “boosters” inside each node.

Vijay Pande (Folding @ Home) was great… clearly a really smart guy, good speaker. He gave a short talk before a roundtable. All of the 2,000,000+ (?) CPU Folding@home nodes are SUPER impressive. But now they’re really quite marginal. He didn’t outright say it, but his slides showed just how dominant the GPU and Cell nodes were, not in number of users, but in total work done.

From memory, about 50% of the F@H work is done on GPU now, 35% on Cell (PS3), and 15% on CPUs. And he was especially excited about GPU not because it was better than Cell, but because everyone had a graphics card already and he was expecting to tap into those millions of cards.

Again from memory, he thought that if everyone who was using a CPU for F@H now just used a mid-level GPU, they’d pass 1000 petaflops. (hmm, my memory is vague, it was something about expecting 1000 petaflops.)

Other trivia, apparently one bottleneck on x86 CPUs was the speed of transcendentals… like pow() and erfc()! Apparently it comes up in electron orbital potential calculations a lot. And the GPUs were much better than CPUs with it. It wasn’t a trivial effect, it was like a net 2x speed effect just from transcendental speed. This also explains how the GPU seems to get especially good boosts from the GPU (~100x faster than the CPU) It’s not just the high flop GPU performance, but also because the erfc() flop is cheaper on the GPU than an x86 erfc() flop. Interesting how you can get such specific bottlenecks.

Well, thanks! Always good to hear someone finds Robert’s and my work useful in some sense :)

I’ve failed ridiculously at PDF’ing the slides, so the result looks partially crape and is no longer printable, but: An extended set of slides of our talk with 60 minutes worth of content (many blanked out slides included) is available here:



Yep, I’ll definitely be looking into implementing some mixed precision in my application. The test systems I’m running now don’t need it, but it is conceivable that some systems could result in the accumulation of large numbers added to small ones.

Robert also came to my poster and we had a good discussion on precision in chaotic systems like molecular dynamics.

the point that didn’t come across in our talk is that an experimental evaluation of mixed precision is ridiculously trivial: In templated C++, this is for free, in plain C (or Fortran which I am using historically), a bit of admittedly hacky preprocessor magic does the trick.

I don’t know what Robert said in detail, but: vector accumulations are not the only factor here. My multigrid solver uses NO norm/dot calculation at all under laboratory conditions.

Did Vijay talk about quantum chemistry calculations or classical things? Didn’t know F@H could do quantum stuff…

Actually Vijay talked a little about F@H, a little about molecular simulation, a little about CUDA. There was an Nvidia engineer there (sorry I didn’t catch the name) who did much of the F@H CUDA work, he was more specific about bottlenecks.

Vijay probably was talking about OpenMM even more than F@H. The transendental

comments came from some discussion in one of his slides, where he showed that throughput was about 400 Mflops on a 280GTX, “But if you want to compare that to x86 flops, it’d be 2.4 Tflops”… a nomenclature convention because the transcendentals are 20x slower on x86 than a multiply. (Sorry if I get those numbers wrong, this is from memory again since I was just there to learn a little, it’s not my usual topic.)

Unfortunately I missed you, haven’t had the pleasure to meet you.

About the debugger: apparantly NVIDIA has a debugger for a while now, it only goes down to deep, so they have/had to adjust it to stop at PTX level. Also another company or 2 are hired by NVIDIA to make a visual UI, since the NVIDIA one is just gdb based. That is why it is taking longer than they always ‘promise’

I personally did not go to the roundtables, but went to the talk about the raytracer. In short : it is using like a hundred or more rays per pixel, uses a BVH overall structure with the contents of the BVH either BVH or KD-tree, although the demo is only BVH.

They use short-stack and there is an SDK coming so people can use the raytracer.

The guys from Jacket also gave a talk & demo, and they look to be heading in a very good direction. Easy transition from matlab to matlab on GPU, extensive visualization options (realtime) and acccess to ALL of opengl from matlab. And they are working on writing C code like matlab code (so no indexing funniness needed). They definately have something potentially very useful as a lot of matlab programmers are not very good C programmers (and the matlab-c interface is not so easy/flexible)

Well, jetlagged as I am right now, I will have to check my notes next week to see wat I forgot about. I definately liked NVISION, I liked the fact that it was not really setup as a bang ourselves on the chest show from NVIDIA, but indeed much more broader with lots of other companies that do nice things.

Also I finally understand how a GPU works after the demo by the Mythbusters ™ ;)

When going through my pictures I noticed that nvcc --multicore is supposed to output multi-threaded C code as an intermediate. That would be very nice if people want to tweak that code afterwards.

I don’t think it’s really intended for human consumption, but sure, in theory, you could certainly tweak it.

What part of C++ is missing from CUDA kernels? Meaty stuff like member functions, operator overloading, and templates are already there. I know exception handling is a big missing piece to get Boost working. Some stuff with templates may also be missing. A big hurdle is that pointer memory-space inference gets tripped up (“don’t know what pointer points to, assuming global mem”). Overall, though, the current level of C++ is pretty workeable.

EDIT: Oh right, I guess new and delete might be useful :">

That sounds reasonable (assuming two GPUs/card standard and more circuits dedicated to flops like in ATI chips). But GPUs will be among the first to feel the end of Moore’s Law in ~2013, so that’ll sort of be as good as it gets.