NVISION highlights?

It was per GPU as far as I know. And for the calculation part I can see it happen, but mem-bandwidth 1+ TB/s? I had the impression that bandwidth is far more difficult to expand.

Are you talking about “CUDA Kernels” OR “Host CPU functions of CUDA?”

I think some1 already did some nice stuff for Host CPU functions… like “devptr>>hostptr” would copy from device to host and so on…

That part is ok.

I was thinking about C++ inside CUDA kernels…(Although my earlier comment is not exactly suggestive of that…)

Even if it’s not officially supported, C++ works pretty well inside kernels, ever since Cuda 1.0.

Only the “dynamic” part doesn’t work (virtual functions, dynamic allocation, etc.), but just as recursion, function pointers, setjmp/longjmp and malloc aren’t supported in C.

There are just a few issues, like use of shared memory in member functions, template member functions of non-template classes, and incompatibilities with CPU emulation.

All of these can be worked around for now, and should be fairly easy to fix by NVidia when providing official support.

The most serious one is as Alex mentioned the lack of an explicit global/shared/local qualifier for pointers, as the compiler often fails to infer it by itself.

I was able to port a part of Boost (the Interval library), and I’m currently working on a Cuda-specific C++ library to handle things like reductions, matrix block transpose, transparent coalesced reads/writes, data structures like stacks or queues in shared and global memory, etc. Both rely heavily on template metaprogramming to generate optimized device code.

Lack of exception handling in host code is a problem too (you can put the code that uses exceptions in a separate C++ file, but then you can’t use templates.)

Sylvain

So, does inheritance n all work?? I am just curious…Can you post some sample CUDA code in C++?

Thanks.

Oh! I just remembered one more juicy bit. Someone from NVIDIA (who will remain unnamed so he doesn’t get a flood of PM’s begging for info) mentioned that he was fairly certain that the CUDA documentation was being rewritten/reorganized.

I don’t suppose he gave any dates?

Inheritance works, but not polymorphism (it would require virtual functions ~= function pointers).

But I don’t think this kind of dynamic OO code is so useful for HPC.

The kind of C++ feature I like better is the ability to perform a lot of work at compile-time using template metaprogramming.

Here is the (slightly simplified) code performing reductions in my library.

It’s not self-contained, but should be enough to show the kind of thing you can do.

Edit: cannot format it properly inside /code/ tags, it is given as attachment…

Here is how it looks from the client side (much better :) ) :

__global__ void reduction_test(float * data_in, float * data_out)

{

  Â Â float x = data_in[threadIdx.x];

  Â Â 

  Â Â // Compute the minimum of the 256 values read

  Â Â float y = reduce<256, root_shared_stack_frame>(x, minimum<float>(), threadIdx.x);

 Â Â if(threadIdx.x == 0) {

  Â  Â  Â Â data_out[0] = y;

  Â Â }

}

It uses a static recursion using templates to generate the recuction code. All the control logic is computed at compile-time, so you end up with only the code that actually does the computation and nothing else.

I checked the generated code using Decuda, and it looks almost optimal. Constant propagation and dead code removal was correctly performed after template expansion.

I had to write a few classes not shown here to handle shared memory allocation, since I’m not satisfied with what Cuda offers out of the box (no shared memory reuse possible).

Of course C++ in device code is not officially supported by NVidia, so if you run into any problem you’re by yourself…
cuda_cpp_sample.cpp (2.71 KB)

VOOOWW!!

I did not know we could do all these stuff!! But that said, I am a zero in C++…

but interesting to know such possibilities!!

Thank you so much for sharing this code!

Also interesting to know that NVIDIA does not officially suppor this…

Nope, sorry.

Well, GDDR5 in a 4870 puts outs 115GB/s on a 256bit bus. If you move that up to 512bit you have 230. That’s feasible today. I think we can say we’ll double mem frequency in the next five years. That’s 460. Then we can say NVIDIA is fudging the numbers and summing two GPUs, or we can deduce NVIDIA thinks we’ll move to 1024bit busses by that time. A 1024bit bus isn’t infeasible by any means. It’s just a matter of cost. It requires a large chip package that has a boat-load of pins (this isn’t unreasonable) and a PCB that has double the layers (not higher technology, just more PCBs made and glued together).

But it’ll be a whole other story pushing bus frequency above 7GHz or bus width past 1024. Perhaps optical links will come to the rescue, as has been prophesized.

That’s all reasonable progression. But there could be architecture changes too…

imagine taking and adding a new class of “fast” memory, similar to a L3 cache, probably off-die but in the same chip package, that gives super-fast (say) 16 MB of storage. That’d be awesomely useful… faster and lower latency than global (board) memory, much larger than current on-die shared memory. (this new hypothetical memory wouldn’t have to be an L3 cache, but it’s probably similar size and speed).

If such a memory layer were added, you could quite reasonably add its bandwidth to the count. L3 caches today are 75GB/sec, but a GPU version could be faster since it wouldn’t be a hardware managed cache. So, making up numbers, I could see a new class of 250GB/sec off-die memory added to the hardware… certainly helping get to that 1TB/sec projection.

All speculation of course. It’s fun to speculate. But boy, thinking about such a feature is nice, current shared memory is very constraining!

Yup yup. No, what you’d do by 2013 actually is stack that memory on top of the GPU. So you’re not using pcb traces to connect them (which is the whole problem), you’re using the surface area of the chips (almost as if it’s one piece of silicon). That could give you [pulls large number out of his ass] 10TB/s bandwidth. Basically you would move shared memory, the caches, maybe even the registers to that 2nd die. And indeed, 16MB would be sweet. Then again someone will come along and say, “but I wanna use that 2nd die for logic” and so we’re back to square one External Media

I wouldn’t want to be the engineer thinking about thermal design in a 2-layer die… ;)

… but I would happily volunteer to try out a piece of HW that tears the memory wall apart and still clocks in reasonably well in green computing ;)

Speculation is indeed fun! To drive this to the max: What would quantum GPUs do in terms of the memory wall problem? Use an infinite stack of memory? :)

Speaking of which, has any attendee gotten their Quadro yet? They did say it’d be about a month, so it should be soon.

No not yet, but I had the idea the questionnaire was also a bit later as expected

Looks like the first reviews of the new Quadro has been released: the FX4800, the 55nm GT200B 192SP 1.5GB version.
[url=“http://www.3dprofessor.org/Reviews%20Folder%20Pages/FX4800/FX4800P1.htm”]http://www.3dprofessor.org/Reviews%20Folde...00/FX4800P1.htm[/url]

So to bump this thread… but does this mean that NVISION attendees will get their Quadros soon?

I received a tracking number last friday. It hasn’t left the US yet, but it is on the move, so I guess they started to ship them to the attendees. :)

Just received it :)

Denis, congrats! What is it, an FX4800?

No word about mine yet… not even an email. I’ll be patient for a little longer. :-)