NVISION highlights?

SPWorley · August 25, 2008, 7:56pm

NVision started today, but most of the CUDA tech talks are Tuesday and Wednesday.
Anyone have any recommendations for the not-to-miss sessions?
Schedule is here: [url=“http://speakers.nvision2008.com/agenda/”]http://speakers.nvision2008.com/agenda/[/url]

I’m particularly looking forward to the computational biology sessions… it’s not something I"ve worked with so it’s a chance to finally learn something about them.

E.D_Riedijk · August 26, 2008, 12:03am

I will be going to the talks in SJCC J, as they seem most interesting. Keynote was fun. Unfortunately raytracing on GPU was a NVSCENE-only talk this morning, I was very interested in that…

SPWorley · August 26, 2008, 12:19am

The raytracing session is on Wednesday from 12:30 to 1:30.

It is indeed in the game track (why!?).

But most developers have signed up for the whole conference… with the free Quadro they give you, the signup fee isn’t so painful anymore.

E.D_Riedijk · August 26, 2008, 1:51am

No, I meant the raytracing on GPU from the NVSCENE (really ;)) it was at 11:00 am. I was originally regsitered at early bird rate, but also happily paid the 200 extra :D

I will certainly be going to the talk Wednesday. It is an impressive demo, but what do you want with 4x C1060, so basically a S1070 of processing power…

I am getting fed up a bit seeing again slides with a debugger mentioned on them. I have seen slides with a debugger mentioned on them last year already. Maybe it is time that we can actually use it? Interesting to hear btw that CUDA 2.1 will have the multicore CPU compiler target. Also interesting that C++ is coming.

MisterAnderson42 · August 27, 2008, 8:44pm

For those not at the technical keyotes on day 1, there were a few “teasers” listed for upcoming CUDA releases.

nvcc --multicore for running CUDA on multi-core CPUS will be in CUDA 2.1. The wait for this release shouldn’t be too long…
C++ code in kernels
Fortran CUDA code (I don’t know if this is related to Flagon, or if NVIDIA is modding a FORTRAN compiler so that CUDA kernels can be written in FORTRAN… it was just a teaser on the timeline that wasn’t discussed).
I forget the exact wording, but something like “Advanced multitasking” which indicates potentially multiple simultaneous kernel launches. Or perhaps it could mean concurrent CUDA and graphics use of the GPU.

E.D_Riedijk · August 27, 2008, 9:00pm

And don’t forget the talk before David Kirk where someone was extrapolating GPU performance & coming up with 10+ TFLOPS & 1+ TB/s for 2013. I find the last one highly unlikely though…

SPWorley · August 27, 2008, 11:43pm

Nvision is over… MisterAnderson, I met you for 1 minute but never found you again to pick your brain! Ah well.

Interesting conference… like any technical conference, some sessions I looked forward to were boring, other sessions I figured would be filler turned out to be very interesting. But most interesting overall was likely the networking, finding other tech geeks and arguing passionately about shared memory or register pressure workarounds.

One (nameless) Nvidia engineer and I had a discussion, he claimed that the G200 architecture not only doubled the number of registers, but also doubled their size to 64 bits, so you could (if you wanted to) use registers to hold more data than before, though you’d have to unpack them to actually access the two 32 bit ints or floats. I was surprised but also intrigued because every byte of storage can be very useful! But then I mentioned this to another Nvidia employee, he just laughed. I suspect that engineer #1 was confused.

A small tech detail I learned, the G80 and G200 hardware is capable of 16 thread warps. This is NOT exposed in CUDA, but IS used in the graphics drivers. It was abstracted from CUDA because of the minor bang-for-buck impact and looking ahead to future hardware. Just my speculation, probably using a smaller warp of 16 would mean that it would cut the maximum number of threads in flight in half, increasing the chances of register stalls, latency issues, etc, and the graphics driver hardcore coders work at a finer level and can tweak their code to work around it… but for more general CUDA it’s better to use the larger warp size to make such hard-to-measure problems less common.

EDR: at the computational biology roundtable, the fabled “real soon now” debugger was also discussed with longing desire.

Very fun conference, sometimes the smaller venues are the best places to meet and geek, SIGGRAPH is just too busy. Thanks to Nvidia for organizing it!

seibert · August 28, 2008, 1:41am

Woo! Sign me up!

Sarnath · August 28, 2008, 8:56am

Oh! Thats so nice of you! If possible, post the highlights of rest of the sessions as well!!

C++ code in kernels – Hmm… Unles they give support for “call” instruction, this would be a dream.

Even if hardware does NOT support “call” – the compiler can still use a return-address register probably coupled with the register-spill method in SPARC cpus to get this working!! They would need one return-address register for 1 Warp. 24 RARs for 1 MP. i.e. 24 registers to support a call-depth of 1 per MP. Taking 192 registers from the MP’s 8192 pool, the compiler could directly support a call depth of 8 – which looks like a good enough amount! The total number of registers available for a kernel will get decreased to 8000 per MP!!! Further if the number of bits required to store a return address is less than 32 – this value can still be boosted!

By using #pragma to enable/disable this support - the compiler can still remain backward compatible!

But Still, I just cant visualize C++ in CUDA kernels… How will it look like?

cuda_kernel_object.launch() ???

MisterAnderson42 · August 28, 2008, 1:37pm

I don’t have any idea what it will look like either. I guess we’ll just have to wait and see.

Highlights of other sessions? There is way too much to even remember: most of it detailed information about the speaker’s specific research.

Let’s see: there was John Stone on computational biology. He talked about various methods for computing energy on a grid and their tradeoffs and performance.

I forget most of the names, but next was a talk on using CUDA for astronomy simulations using fortran. This was from the group that developed Flagon. They took some extremely complicated tree-based algorithms and mapped them to CUDA getting ~30x speedups (although that wasn’t done in FORTRAN, it was a custom kernel).

Another talk was on FLAME which is aiming to be a LAPACK replacement. They are doing some very cool work where any “matrix algorithm” is expressed in a language and platform independent manner. Then FLAME steps in and maps it to whatever hardware you want to run on. The talk was mainly about the CUDA implementation of FLAME which has a faster SGEMM than CUBLAS by blocking the calculation and using CUBLAS’s SGEMM on small blocks. The performance of several higher level matrix routines such as Cholesky were discussed.

Umm, there was another very good talk on the details of floating point precision and techniques for using mixed precision in GPU kernels to keep all the accuracy with little or no performance penalties.

Another dense talk that I didn’t understand much at all discussed optimizing a 3D FFT calculation. I think it was 3x faster than CUFFT, but my memory could be off. Later in that talk, they discussed the details of a huge GPU cluster they are building (100’s and 100’s of Tesla GPUs) using windows compute server 2008 (ick!). Of all the GPU clusters I’ve seen mentioned, this is certainly the biggest.

What am I missing… Someone from NVIDIA gave a talk on computer vision using CUDA. The biggest thing I remember about that talk was that automotive companies are looking into putting many cameras on all sides of the vehicle and using CUDA to process the data. One of the things they can do with it are to warn the driver of an impending collision.

I think I’m still missing a talk or two.

SPWorley · August 28, 2008, 4:01pm

Let see, the Tokyo Institute of Technology talk (which had the fast 3D FFT subtalk) also was interesting in that their supercomputer (Tsubame) gets most of its flop speed from Clearspeed accelerators. The speaker pretty much said that GPUs crush those 2-year-old accelerators both in performance and in price by more than an order of magnitude each. It gave me the impression that world-class top-500 supercomputers may be designed as custom nodes with loads of bandwidth and RAM, but then use GPU “boosters” inside each node.

Vijay Pande (Folding @ Home) was great… clearly a really smart guy, good speaker. He gave a short talk before a roundtable. All of the 2,000,000+ (?) CPU Folding@home nodes are SUPER impressive. But now they’re really quite marginal. He didn’t outright say it, but his slides showed just how dominant the GPU and Cell nodes were, not in number of users, but in total work done.

From memory, about 50% of the F@H work is done on GPU now, 35% on Cell (PS3), and 15% on CPUs. And he was especially excited about GPU not because it was better than Cell, but because everyone had a graphics card already and he was expecting to tap into those millions of cards.

Again from memory, he thought that if everyone who was using a CPU for F@H now just used a mid-level GPU, they’d pass 1000 petaflops. (hmm, my memory is vague, it was something about expecting 1000 petaflops.)

Other trivia, apparently one bottleneck on x86 CPUs was the speed of transcendentals… like pow() and erfc()! Apparently it comes up in electron orbital potential calculations a lot. And the GPUs were much better than CPUs with it. It wasn’t a trivial effect, it was like a net 2x speed effect just from transcendental speed. This also explains how the GPU seems to get especially good boosts from the GPU (~100x faster than the CPU) It’s not just the high flop GPU performance, but also because the erfc() flop is cheaper on the GPU than an x86 erfc() flop. Interesting how you can get such specific bottlenecks.

e.ping · August 28, 2008, 11:38pm

Well, thanks! Always good to hear someone finds Robert’s and my work useful in some sense :)

I’ve failed ridiculously at PDF’ing the slides, so the result looks partially crape and is no longer printable, but: An extended set of slides of our talk with 60 minutes worth of content (many blanked out slides included) is available here:

http://www.mathematik.tu-dortmund.de/~goed…dprecision.html

dom

MisterAnderson42 · August 29, 2008, 12:22am

Yep, I’ll definitely be looking into implementing some mixed precision in my application. The test systems I’m running now don’t need it, but it is conceivable that some systems could result in the accumulation of large numbers added to small ones.

Robert also came to my poster and we had a good discussion on precision in chaotic systems like molecular dynamics.

e.ping · August 29, 2008, 12:40am

the point that didn’t come across in our talk is that an experimental evaluation of mixed precision is ridiculously trivial: In templated C++, this is for free, in plain C (or Fortran which I am using historically), a bit of admittedly hacky preprocessor magic does the trick.

I don’t know what Robert said in detail, but: vector accumulations are not the only factor here. My multigrid solver uses NO norm/dot calculation at all under laboratory conditions.

wandering · August 29, 2008, 1:00am

Did Vijay talk about quantum chemistry calculations or classical things? Didn’t know F@H could do quantum stuff…

SPWorley · August 29, 2008, 2:40am

Actually Vijay talked a little about F@H, a little about molecular simulation, a little about CUDA. There was an Nvidia engineer there (sorry I didn’t catch the name) who did much of the F@H CUDA work, he was more specific about bottlenecks.

Vijay probably was talking about OpenMM even more than F@H. The transendental

comments came from some discussion in one of his slides, where he showed that throughput was about 400 Mflops on a 280GTX, “But if you want to compare that to x86 flops, it’d be 2.4 Tflops”… a nomenclature convention because the transcendentals are 20x slower on x86 than a multiply. (Sorry if I get those numbers wrong, this is from memory again since I was just there to learn a little, it’s not my usual topic.)

E.D_Riedijk · August 29, 2008, 2:17pm

Unfortunately I missed you, haven’t had the pleasure to meet you.

About the debugger: apparantly NVIDIA has a debugger for a while now, it only goes down to deep, so they have/had to adjust it to stop at PTX level. Also another company or 2 are hired by NVIDIA to make a visual UI, since the NVIDIA one is just gdb based. That is why it is taking longer than they always ‘promise’

I personally did not go to the roundtables, but went to the talk about the raytracer. In short : it is using like a hundred or more rays per pixel, uses a BVH overall structure with the contents of the BVH either BVH or KD-tree, although the demo is only BVH.

They use short-stack and there is an SDK coming so people can use the raytracer.

The guys from Jacket also gave a talk & demo, and they look to be heading in a very good direction. Easy transition from matlab to matlab on GPU, extensive visualization options (realtime) and acccess to ALL of opengl from matlab. And they are working on writing C code like matlab code (so no indexing funniness needed). They definately have something potentially very useful as a lot of matlab programmers are not very good C programmers (and the matlab-c interface is not so easy/flexible)

Well, jetlagged as I am right now, I will have to check my notes next week to see wat I forgot about. I definately liked NVISION, I liked the fact that it was not really setup as a bang ourselves on the chest show from NVIDIA, but indeed much more broader with lots of other companies that do nice things.

Also I finally understand how a GPU works after the demo by the Mythbusters ™ ;)

E.D_Riedijk · August 30, 2008, 11:05am

When going through my pictures I noticed that nvcc --multicore is supposed to output multi-threaded C code as an intermediate. That would be very nice if people want to tweak that code afterwards.

tmurray · August 30, 2008, 6:40pm

I don’t think it’s really intended for human consumption, but sure, in theory, you could certainly tweak it.

alex_dubinsky · August 31, 2008, 5:24pm

What part of C++ is missing from CUDA kernels? Meaty stuff like member functions, operator overloading, and templates are already there. I know exception handling is a big missing piece to get Boost working. Some stuff with templates may also be missing. A big hurdle is that pointer memory-space inference gets tripped up (“don’t know what pointer points to, assuming global mem”). Overall, though, the current level of C++ is pretty workeable.

EDIT: Oh right, I guess new and delete might be useful :">

That sounds reasonable (assuming two GPUs/card standard and more circuits dedicated to flops like in ATI chips). But GPUs will be among the first to feel the end of Moore’s Law in ~2013, so that’ll sort of be as good as it gets.

Topic		Replies	Views
Convincing skeptical bigwigs on the future of CUDA CUDA Programming and Performance	49	8878	March 19, 2009
CUDA 1.0 FAQ (OBSOLETE) Frequently asked questions about CUDA Announcements	2	75916	February 9, 2009
Kepler and Maxwell, oh my! CUDA Programming and Performance	55	55991	October 19, 2010
cuda for ati cards we need a stadard CUDA Programming and Performance	27	43551	October 3, 2008
most effective way to get a mobile CUDA gpu CUDA Programming and Performance	24	7739	September 29, 2008
CUDA 4.1 suggested improvements. CUDA Programming and Performance	32	45701	October 8, 2011
CUDA 3D Rendering Mystery CUDA Programming and Performance	25	16209	June 16, 2010
Wishlist Place your considered suggestions here CUDA Programming and Performance	201	205164	April 13, 2009
NDA expiration - new GF100 information CUDA Programming and Performance	47	19775	February 9, 2010
CUDA not ready for prime time How can anyone justify CUDA for corporate development? CUDA Programming and Performance	35	15411	June 7, 2010

NVISION highlights?

Related topics