CUDA 2.2 beta features

tmurray · March 18, 2009, 8:31pm

The CUDA 2.2 beta is available to registered developers–if you want to become a registered developer, sign up here.

A brief overview of CUDA 2.2 beta features:

Zero-copy support (see this thread for more information)
Asynchronous memcpy on Vista/Server 2008
Texturing from pitchlinear memory
cuda-gdb for 64-bit Linux (it is pretty great)
OGL interop performance improvements
CUDA profiler supports a lot more counters on GT200. I think this includes memory bandwidth counters (counters for each transaction size) and instruction count. In other words, you can very easily determine if you’re bandwidth limited or compute limited, which makes it far more useful than it used to be.
CUDA profiler works on Vista
4GB of pinned memory in a single allocation (except in Vista, where the limit is still 256MB per allocation, but I think this is going to be raised between now and the final release)
Blocking sync for all platforms. Whether this made it into the headers for the beta, I’m not entirely sure–I’ve heard conflicting reports and need to check this afternoon. Basically, it’s a context creation flag where instead of spinlocking or spinlocking+yielding when a thread is waiting for the GPU, the thread will sleep and the driver will wake it up when the event has completed. It’s not the default mode because you’re at the mercy of the OS thread scheduler which will sometimes increase latency, but if you want to minimize CPU utilization, it’s very nice.
Officially supports Ubuntu 8.10, RHEL 5.3, Fedora 10

There’s one last feature that didn’t make it in the beta that I think is the best feature in 2.2 (even compared to the dramatically improved profiler, zero-copy and the 64-bit debugger), but I don’t want to spoil it…

Edit: Here’s the 2.2 beta programming guide.

edit 2: I am bad at not revealing surprises. There’s still a second surprise in the final release for Windows users, though.

edit 3: Surprise 2: a test version of /MD CUDART. I revealed it because I want feedback on it and whether anyone has objections to moving everything over to /MD going forward.

tmurray · March 18, 2009, 9:43pm

Some more features:

There are a number of new device functions:

__brev(), __brevll() 32-bit and 64-bit bit reversal
__frcp_r{n,z,u,d}() single-precision reciprocal with IEEE rounding
__fsqrt_r{n,z,u,d}() single-precision square root with IEEE rounding
__fdiv_r{n,z,u,d}() single-precision division with IEEE rounding
__fadd_r{u,d}() single-precision addition with directed rounding
__fmul_r{u,d}() single-precision multiplication with directed rounding

__threadfence(): I’m not sure if there are docs for this yet–it’s kind of hard to explain, so I’m not going to comment too much about it here because I forget what its exact behavior is.

Context creation flags can now be set in CUDART.

jack · March 18, 2009, 9:52pm

One other function that might be neat to have would be a byte-order reversal method. Though CUDA only runs on little-endian systems, there are times when certain file types store their information in big-endian format; converting values like integers between endianness on the CPU could be a big bottleneck in those cases, but it is also something the GPU could do in massively parallel fashion. Perhaps versions for 2-byte, 4-byte, and 8-byte values (which would obviously work for both floating-point and integer types).

Also, will there ever be support for non-nVidia chipsets using the zero-copy methods (even if it’s not for another few releases)? As I wrote in one of the other threads, I’m looking at building a new development machine later this year (when PCIe 3.0 and SATA 6Gbps are available), and I’d like to get something that is supported.

AndreiB · March 18, 2009, 10:11pm

Are there any limitiation on device compute capability to use __brev() and __brevll()?

Yes, there is something about it in 2.2 Programming Guide.

tmurray · March 18, 2009, 10:42pm

Yeah, I haven’t looked at the docs for 2.2 yet…

There’s no limitation on device capability for those two functions. There’s another function left out of the earlier post:

__fmaf_r(n,z,u,d} // single-precision fused multiply-add with IEEE rounding

These are all done in software, so they’re primarily for convenience, not speed.

E.D_Riedijk · March 18, 2009, 11:10pm

Since you are in such a talkative mode (apart from the surprise still in store for us ;))
The ptx ISA has been raised to 1.4, while the compute capability is still at 1.3. As far as I know, before it was going in sync. Does this mean that next generation hw will be compute capability 2.0 ?

Apart from that I can’t wait to upgrade my dev box from FC8 to FC10 so I can install the 2.2 beta (profiler and debugger, here I come). Anyone have any tips for upgrading 8 → 10?

tmurray · March 18, 2009, 11:37pm

I can’t comment on what the future holds, sorry. (Unless, of course, you want to know about CUDA 2.2…)

Sarnath · March 19, 2009, 5:22am

Add a 2 :D

SPWorley · March 19, 2009, 7:37am

Ohh these are awesome. Simple little trivial functions but handy!

These can be really useful in FFTs, and also in random number generation and seeding… I even posted to the wishlist thread!

Are they supported natively by all hardware? One clock ops?

Sarnath · March 19, 2009, 7:46am

I would suspect 2 clocks if there is only one register involved. One clock to open gates parallely to another register and another to copy that register back normally.

If it invovles 2 registers then it will involve one clock.

Just my crude guesses…

tmurray · March 19, 2009, 7:48am

No, they’re done in software, not hardware. Feature requests from the forums, basically.

pvonkaenel · March 19, 2009, 12:13pm

Whao, are these basically GPU intrinsics? I’m very new to CUDA development, and would love more information on calls like these, but must have missed them in the docs. Could someone please point me to where I can learn more about GPU intrinsics?

Thanks,

Peter

SPWorley · March 19, 2009, 12:31pm

(re new intrinsics for bit reversal, etc)

Out of curiosity, are these kind of library functions done at some lower level of coding that’s more efficient? (Some microcode kind of access?) Or are they just sort of wrappers around the kinds of calls we could theoretically do ourselves in PTX?

There’s so many layers of abstraction in any architecture, but in CUDA there’s even more than most, and I’m just curious if the layer of abstraction below intrinsic functions is something powerful and promising for future (pleasant) surprises like this.

(BTW, please give those low level hackers a thumbs up from us all…)

MisterAnderson42 · March 19, 2009, 12:45pm

Appendix B, the CUDA programming guide.

Sarnath · March 19, 2009, 1:34pm

I would expect it to be a “C” Macro probably using some “hardware” feature to do the bit reversing.

Because the argument that you pass could be a “Shared memory”, “local memory”, “local variable” or anything… That should not matter.

That would work only with a macro like thing…

YDD · March 19, 2009, 1:38pm

Does zero-copy get support on the new MacBooks? I’d like to make a business case for spending someone else’s money, but I can’t find official word on which motherboard the MacBook uses (merely plenty of rumours).

tmurray · March 19, 2009, 4:26pm

The new MacBooks and MacBook Pros support zero-copy.

I don’t know that there’s anything magic about how we’re doing these intrinsics–I think the answer is probably not. They’re really just there for convenience.

YDD · March 19, 2009, 5:22pm

On both GPUs for the MacBook Pro? It could be very interesting to investigate the effect of the PCIe bus on transfer latency & bandwidth.

tmurray · March 19, 2009, 5:44pm

The 9400M supports zero-copy (and copy elimination), the 9600M supports neither.

tmurray · March 19, 2009, 6:07pm

Even more new features:

Topic		Replies	Views
CUDA 2.1 discussion CUDA Programming and Performance	71	63941	February 17, 2009
New Features in CUDA 7.5 Technical Blog	66	1085	August 10, 2016
CUDA Toolkit and SDK 2.3 betas available to registered developers CUDA Programming and Performance	60	104573	July 22, 2009
my speedy FFT 3x faster than CUFFT CUDA Programming and Performance	139	241050	November 16, 2011
Wishlist Place your considered suggestions here CUDA Programming and Performance	201	204317	April 13, 2009
CUDA Toolkit and SDK v2.2 released CUDA Programming and Performance	59	64622	January 25, 2011
CUDA very slow performance CUDA Programming and Performance	21	16697	March 6, 2020
CUDA Toolkit 3.0 beta released now with public downloads CUDA Programming and Performance	104	430097	March 25, 2010
CUDA 2.1 beta CUDA Programming and Performance	49	67162	December 3, 2008
CUDA Toolkit 3.2 release candidate available to registered developers CUDA Programming and Performance	68	63110	December 3, 2010

CUDA 2.2 beta features

Related topics