CUDA 2.2 beta features

The CUDA 2.2 beta is available to registered developers–if you want to become a registered developer, sign up here.

A brief overview of CUDA 2.2 beta features:

  • Zero-copy support (see this thread for more information)
  • Asynchronous memcpy on Vista/Server 2008
  • Texturing from pitchlinear memory
  • cuda-gdb for 64-bit Linux (it is pretty great)
  • OGL interop performance improvements
  • CUDA profiler supports a lot more counters on GT200. I think this includes memory bandwidth counters (counters for each transaction size) and instruction count. In other words, you can very easily determine if you’re bandwidth limited or compute limited, which makes it far more useful than it used to be.
  • CUDA profiler works on Vista
  • 4GB of pinned memory in a single allocation (except in Vista, where the limit is still 256MB per allocation, but I think this is going to be raised between now and the final release)

  • Blocking sync for all platforms. Whether this made it into the headers for the beta, I’m not entirely sure–I’ve heard conflicting reports and need to check this afternoon. Basically, it’s a context creation flag where instead of spinlocking or spinlocking+yielding when a thread is waiting for the GPU, the thread will sleep and the driver will wake it up when the event has completed. It’s not the default mode because you’re at the mercy of the OS thread scheduler which will sometimes increase latency, but if you want to minimize CPU utilization, it’s very nice.
  • Officially supports Ubuntu 8.10, RHEL 5.3, Fedora 10

There’s one last feature that didn’t make it in the beta that I think is the best feature in 2.2 (even compared to the dramatically improved profiler, zero-copy and the 64-bit debugger), but I don’t want to spoil it…

Edit: Here’s the 2.2 beta programming guide.

edit 2: I am bad at not revealing surprises. There’s still a second surprise in the final release for Windows users, though.

edit 3: Surprise 2: a test version of /MD CUDART. I revealed it because I want feedback on it and whether anyone has objections to moving everything over to /MD going forward.

Some more features:

  • There are a number of new device functions:

__brev(), __brevll() 32-bit and 64-bit bit reversal
__frcp_r{n,z,u,d}() single-precision reciprocal with IEEE rounding
__fsqrt_r{n,z,u,d}() single-precision square root with IEEE rounding
__fdiv_r{n,z,u,d}() single-precision division with IEEE rounding
__fadd_r{u,d}() single-precision addition with directed rounding
__fmul_r{u,d}() single-precision multiplication with directed rounding

__threadfence(): I’m not sure if there are docs for this yet–it’s kind of hard to explain, so I’m not going to comment too much about it here because I forget what its exact behavior is.

  • Context creation flags can now be set in CUDART.

One other function that might be neat to have would be a byte-order reversal method. Though CUDA only runs on little-endian systems, there are times when certain file types store their information in big-endian format; converting values like integers between endianness on the CPU could be a big bottleneck in those cases, but it is also something the GPU could do in massively parallel fashion. Perhaps versions for 2-byte, 4-byte, and 8-byte values (which would obviously work for both floating-point and integer types).

Also, will there ever be support for non-nVidia chipsets using the zero-copy methods (even if it’s not for another few releases)? As I wrote in one of the other threads, I’m looking at building a new development machine later this year (when PCIe 3.0 and SATA 6Gbps are available), and I’d like to get something that is supported.

Are there any limitiation on device compute capability to use __brev() and __brevll()?

Yes, there is something about it in 2.2 Programming Guide.

Yeah, I haven’t looked at the docs for 2.2 yet…

There’s no limitation on device capability for those two functions. There’s another function left out of the earlier post:

__fmaf_r(n,z,u,d} // single-precision fused multiply-add with IEEE rounding

These are all done in software, so they’re primarily for convenience, not speed.

Since you are in such a talkative mode (apart from the surprise still in store for us ;))
The ptx ISA has been raised to 1.4, while the compute capability is still at 1.3. As far as I know, before it was going in sync. Does this mean that next generation hw will be compute capability 2.0 ?

Apart from that I can’t wait to upgrade my dev box from FC8 to FC10 so I can install the 2.2 beta (profiler and debugger, here I come). Anyone have any tips for upgrading 8 → 10?

I can’t comment on what the future holds, sorry. (Unless, of course, you want to know about CUDA 2.2…)

Add a 2 :D

Ohh these are awesome. Simple little trivial functions but handy!

These can be really useful in FFTs, and also in random number generation and seeding… I even posted to the wishlist thread!

Are they supported natively by all hardware? One clock ops?

I would suspect 2 clocks if there is only one register involved. One clock to open gates parallely to another register and another to copy that register back normally.

If it invovles 2 registers then it will involve one clock.

Just my crude guesses…

No, they’re done in software, not hardware. Feature requests from the forums, basically.

Whao, are these basically GPU intrinsics? I’m very new to CUDA development, and would love more information on calls like these, but must have missed them in the docs. Could someone please point me to where I can learn more about GPU intrinsics?

Thanks,

Peter

(re new intrinsics for bit reversal, etc)

Out of curiosity, are these kind of library functions done at some lower level of coding that’s more efficient? (Some microcode kind of access?) Or are they just sort of wrappers around the kinds of calls we could theoretically do ourselves in PTX?

There’s so many layers of abstraction in any architecture, but in CUDA there’s even more than most, and I’m just curious if the layer of abstraction below intrinsic functions is something powerful and promising for future (pleasant) surprises like this.

(BTW, please give those low level hackers a thumbs up from us all…)

Appendix B, the CUDA programming guide.

I would expect it to be a “C” Macro probably using some “hardware” feature to do the bit reversing.

Because the argument that you pass could be a “Shared memory”, “local memory”, “local variable” or anything… That should not matter.

That would work only with a macro like thing…

Does zero-copy get support on the new MacBooks? I’d like to make a business case for spending someone else’s money, but I can’t find official word on which motherboard the MacBook uses (merely plenty of rumours).

The new MacBooks and MacBook Pros support zero-copy.

I don’t know that there’s anything magic about how we’re doing these intrinsics–I think the answer is probably not. They’re really just there for convenience.

On both GPUs for the MacBook Pro? It could be very interesting to investigate the effect of the PCIe bus on transfer latency & bandwidth.

The 9400M supports zero-copy (and copy elimination), the 9600M supports neither.

Even more new features: