Wished CUDA 2.2 features!

Hi I have downloaded CUDA 2.2 beta and I have same questions:

First about the runtime:

CORRECTION: support for interop with OpenGL Texture Objects is coming
in a future release. Instead, this release includes improved interop
performance for OpenGL buffer objects in the following situations:

  • Single adapter MultiMon
  • multiple GPUs when GL and CUDA are running on the same GPU
  • multiple GPUs when GL and CUDA are running on different GPUs

does this mean for 2.2 final or next (say 2.3) release?
Also texture objects would allow some the map/register and unmap/unregister overhead by sharing (not copying) this objects between CUDA and OpenGL.

Now CUDPP current release is expected to build with 2.2 toolkit? (in 2.1 there were issues with templates I think with prevented compiling it)
Also I was expecting and updated CUDPP with the new fast sort implementation from Garland et at. “Designing efficient sorting algorithms for manycore GPUs”

Now about the sdk
The SDK seems a lot like 2.1 excepting added deviceQueryDrv and SobolQRNG.
I was expecting to find the Sparse Matrix Vector implementation form Garland and Bell. “Efficient sparse matrix-vector multiplication on CUDA”

Would be impressive if Nvidia add as sample projects some other recent work from his Nvidia employes like:
“Real Time 3D Fluid and Particle Simulation and Rendering” form Sarah Tariq et al. in CUDA Zone
“3D Finite Difference Computation on GPUs using CUDA” Paulius Micikevicius
a port of linpack for CUDA in “Accelerating Linpack with CUDA on heterogenous clusters” by Massimiliano Fatica

Also I hope the big feature to be included in CUDA 2.2 final a not in beta is one of this two:
1.CUDA multicore


Textures: I think this is 2.3.
CUDPP: ask Mark Harris, he’s the guy that knows all about CUDPP.
SDK: there are more things coming for 2.2 final that weren’t quite finished. The 3D FDTD and Linpack apps aren’t exactly suitable for SDK samples…
And no, the last feature is neither of those two.

Also, man, we put all these features in and it’s not good enough :(

I’ll soon install 2.2 but regarding the last sentence, I just had to comment…

I think CUDA (and of course the underlying hardware) is one of the best 3rd party/APIs/Product/3rd party Hardware/…

I’ve seen in my 12 years of experience. Of course nothing is perfect, but i think you’ve done

great job in all aspects: hardware, CUDA, specs, SDK examples and of course these great newsgroups…

just had to share my thoughts and respect toward CUDA… great job.

my 1 cent :)


Hey, here is a very happy camper (gdb-64, the profiler improvements (if I can find it ;)), zero-copy). But I am curious about multicore, since that was sort of expected for the last release already.

Oh and we want NVISION ;)

If I understand zero-copy correctly, it means a running GPU program can establish live communication with a running CPU program, which opens all sorts of possibilities! With the right protocol, a host program could conceivably provide a lot of host services to GPU programs: printf, fopen (!), attaching to and debugging live GPU programs… Oh, the possibilities!

As far as I see, that is indeed possible if you only have 1 kernel running. Then you can set a switch to let the GPU know it can start to process and the GPU could set a switch to let the GPU know it is finished (with the switches in host memory). But I don’t know if that really has a benefit compared to starting a kernel and waiting till it finishes.

If you want the 3DFD code send me a message.


Oh sorry, I didn’t want to be seen as a so exigent guy, I think all at NVIDIA can be very proud of your work… I think it’s my fault for visiting CUDA Zone too often! :rolleyes:

and thankyou for clarifying the questions!

Also I think I’ve found another feature of CUDA 2.2 not anounced and that can be the missing one…

I have found that Geforce drivers for CUDA 2.2 in Windows install a nvcuvenc.dll similar in name to existing nvcuvid.dll. NVCUVID library was introduced in 2.0 and basically allows offloading video decoding to GPU and being these frames avaiable to CUDA programs…

By the name seems that this library performs video encoding (spying the DLL seems to be MPEG1,MPEG2,MPEG4 and H264) on the GPU with CUDA… but a header for using it it’s missing and a sample in the SDK…

Assuming this is not the big feature can we expect this library being usable for all developers or say is not for general developer usage and avaiable to some partners of video products (Cyberlink, Nero,etc…)


Hey, nice info ! Any official comment on that ? Plus, why has nvcuvid.lib been removed from CUDA SDK 2.1/2.2 ?

My official guess for the surprise feature is PTX 1.4 + debugging on the device. If you look at the PTX 1.3 manual, there are some instructions reserved for use with DWARF, but it says that they are not yet implemented. If it’s not in CUDA 2.3, I would think that is still going to appear in some future release.

But that is already in the beta…

Heh :">

I haven’t signed up yet so I wouldn’t know…looks like I am behind the times.

Debugging was already in 2.1 ;) New in 2.2beta is debugging on 64bit linux.

Hi. As you indicated in the forum, I am sending you a message requesting for the 3DFD code. Thank you very much. My email is mpeyvandi09@yahoo.com.


except you have PCIe communication problems (it’s been mentioned somewhere in the forums) … it’s probably worse than “not sequentially consistent”

he said on the device. cuda-gdb is host-only, no?

why would something called cuda-gdb not touch the device? it’s cuda-gdb for a reason–of course it lets you do debugging on the device

Speaking of which…tmurray, have you guys looked at the Beta version of Visual Studio 2010? The new parallel task debugging window looks awesome! I’d love to have something like that for CUDA…


I don’t spend much time in Windows these days, and when I do I’m generally not running Visual Studio. But I do know that A. VS2010 is coming out and B. I would like to support it a lot faster than it took us to support VS2008. Not sure how that kind of thing would work for CUDA, though, since it’s not like deadlock and traditional parallel pitfalls (well, race conditions) really happen in a valid CUDA kernel. Per-CTA tracking would probably be useful, though.