CUDA Toolkit and SDK 2.3 betas available to registered developers

The CUDA Toolkit and SDK 2.3 betas are now available to registered developers. They include the following features:

    The CUFFT Library now supports double-precision transforms and includes significant performance improvements for single-precision transforms as well. See the CUDA Toolkit release notes for details.

    The CUDA-GDB hardware debugger is now available for all supported Linux platforms and is included in the CUDA Toolkit installer.

    GPUs in an SLI group are now enumerated individually, so you can achieve multi-GPU performance even when SLI is enabled for graphics.

    New support for fp16 <-> conversion intrinsics allows storage of data in fp16 format with computation in fp32. Use of fp16 format is ideal for applications that require higher numerical range than 16-bit integer but less precision than fp32 and reduces memory space and bandwidth consumption.

    The CUDA SDK has been updated to include:

      A new pitchLinearTexure code sample that shows how to effeciently texture from pitch linear memory.

      A new PTXJIT code sample illustrating how to use cuModuleLoadDataEx() to load PTX source from memory instead of loading a file.

      Two new code samples for Windows, showing how to use the NVCUVID library to decode MPEG-2, VC-1, and H.264 content and pass frames to OpenGL or Direct3D for display.

      Updated code samples showing how to properly align CUDA kernel function parameters so the same code works on both x32 and x64 systems.

    The Visual Profiler (packaged separately) includes several enhancements:

      All memory transfer API calls are now reported

      Support for profiling multiple contexts per GPU.

      Synchronized clocks for requested start time on the CPU and start/end times on the GPU for all kernel launches and memory transfers.

      Global memory load and store efficiency metrics for GPUs with compute capability 1.2 and higher.

    The CUDA Driver for MacOS is now packaged separately from the CUDA Toolkit.

    Support for major Linux distros, MacOS X, and Windows:

      Fedora 10, RHEL 4.7 & 5.3, SLED 10.2 & 11.0, OpenSUSE 11.1, and Ubuntu 8.10 & 9.04

      Windows XP/Vista/7 with Visual Studio 8 (VC2005) and 9 (VC2008)

      MacOS X 10.5.6 and later (32-bit)

#3 on that list is probably the most important for most people here (hi GTX 295 users)…

the feature I really like didn’t quite make it into the beta, so you have something to look forward to for release as well.

No vista driver for the new toolkit ?

It’s a nice change to have a current Ubuntu distribution supported while it’s still the current distribution. Things seem to be moving along pretty quickly now. Has the team gotten bigger?

Ugh, CUDA 2.3 won’t support Ubuntu 8.04LTS, will it? That is bad news for me.

Yes, and I complained a lot about the lack of 9.04 support in the original 2.3 plan… :)

sergeyn, there should be a new Vista driver as well. If it’s missing, I’ll try to find out over the weekend (but it probably won’t be resolved until Monday).

I don’t know what the status of 8.04 is in final, but I imagine it will be dropped.


That file won’t load for me. Says page cannot be displayed. Is it just me, or… ?


Great, thanks !!

Does that also mean texture read/writes are counted in the GB/s throughput??



Sounds very cool, waiting for the release version!

I’m looking forward to that fp16 conversion!

Is that fp16<->fp32 done in hardware (so it’s likely a one-clock op?) or is it a new library intrinsic that’s convenient but not much faster than if we had written our own functions?

I’ve submit an application to be a registered developer. Hope it will be accepted before CUDA 2.3 is released officially <img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=’:’(’ />

Edited: Thanks heavens, the God has heard my wish. Will try it on Kubuntu 9.04 shortly.

That is correct. Ubuntu-8.04 will not be supported after CUDA_2.2.

It would be nice if support could be maintained for the LTS series - if there are no functionality/performance issues - at least until the next release.

Is there something missing in 8.04 (2.6.24/gcc 4.2.4) that would hold back CUDA? I don’t use X at all, so that is not an issue.

I agree with this comment. LTS releases are preferable on servers. It will be inconvenient to upgrade the OS on servers just for letting CUDA work.

Ubuntu 8.04 LTS will be supported until mid 2011, and some people do not upgrade OS before that time. I think CUDA team should still work on it until at least mid 2010 when Ubuntu 10.04LTS is released.

Yes, the hardware includes support for fp16 conversions, so the new intrinsics should be faster than using your own code.

Doesn’t work for me neither, nor the 32 bit file, nor the win7 drivers…

I have submitted a bug report.


Some questions:

What about CUDA interop with OpenGL Texture Objects… They are coming in this release 2.3?

CUDA 2.3 would include a PTX version 1.5 codegen such that is compatible with the OpenCL LLVM codegen and can PTX sources compiled from one or other backend echanged?

New support for fp16 <-> conversion intrinsics,

I suspect are these:

extern device unsigned short __float2half_rn(float);

extern device float __half2float(unsigned short);

Quoting an NVIDIA employee

“Yes, the hardware includes support for fp16 conversions, so the new intrinsics should be faster than using your own code.”

Is that hardware support, or these is the CPU emulation path?:

device_func(unsigned short __float2half_rn(float f))


unsigned int x = __float_as_int (f);

unsigned int u = (x & 0x7fffffff), remainder, shift, lsb, lsb_s1, lsb_m1;

unsigned int sign, exponent, mantissa;

/* Get rid of +NaN/-NaN case first. */

if (u > 0x7f800000) {

return 0x7fff;


sign = ((x >> 16) & 0x8000);

/* Get rid of +Inf/-Inf, +0/-0. */

if (u > 0x477fefff) {

return sign | 0x7c00;


if (u < 0x33000001) {

return sign | 0x0000;


exponent = ((u >> 23) & 0xff);

mantissa = (u & 0x7fffff);

if (exponent > 0x70) {

shift = 13;

exponent -= 0x70;

} else {

shift = 0x7e - exponent;

exponent = 0;

mantissa |= 0x800000;


lsb = (1 << shift);

lsb_s1 = (lsb >> 1);

lsb_m1 = (lsb - 1);

/* Round to nearest even. */

remainder = (mantissa & lsb_m1);

mantissa >>= shift;

if (remainder > lsb_s1 || (remainder == lsb_s1 && (mantissa & 0x1))) {


if (!(mantissa & 0x3ff)) {


  mantissa = 0;



return sign | (exponent << 10) | mantissa;


device_func(float __half2float(unsigned short h))


unsigned int sign = ((h >> 15) & 1);

unsigned int exponent = ((h >> 10) & 0x1f);

unsigned int mantissa = ((h & 0x3ff) << 13);

if (exponent == 0x1f) { /* NaN or Inf */

mantissa = (mantissa

            ? (sign = 0, 0x7fffff)

            : 0);

exponent = 0xff;

} else if (!exponent) { /* Denorm or Zero */

if (mantissa) {

  unsigned int msb;

  exponent = 0x71;

  do {

    msb = (mantissa & 0x400000);

    mantissa <<= 1;  /* normalize */


  } while (!msb);

  mantissa &= 0x7fffff;  /* 1.mantissa is implicit */


} else {

exponent += 0x70;


return __int_as_float ((sign << 31) | (exponent << 23) | mantissa);


Inspecting also I found

device_func(void __synchronous_start(int s))


/* TODO */


device_func(void __synchronous_end(void))


/* TODO */


What __synchronous_start and __synchronous_end are supposed to be?

I think that can be global synch points that now require separate kernel launches…

would be ready for 2.3?


1.Also I don’t want to worry anyone but what about CUDA multicore, it’s dead?

Related to NVCUVID, 2wishes:

  1. Port CUVID OpenGL sample to include a CUDA video decoder sample in Linux using VDPAU…

  2. Seems next step is a NVCUVENC sample in CUDA sample… (CUDA video encoder library)

4.Also include CUDA 64bit library for MACOS


sigh, yeah, something broke with the Windows drivers. we’re working on it now.

I was wondering about this myself, just last night. I haven’t heard a mention of it at all in some time now. Though, I would think that someone (nVidia or otherwise) will eventually release a straight OpenCL -> CPU compiler, since some of the programming methods used in OpenCL could be translated to a multicore CPU + SIMD instructions (like a Pentium 4 with SSE2).

I assume that CUFFT has finally been updated to include vvolkov’s FFT code? What about CUBLAS? Are we going to see a new release of it for the 2.3 product cycle?