CUDA Toolkit and SDK 2.3 betas available to registered developers

tmurray · June 20, 2009, 6:17am

The CUDA Toolkit and SDK 2.3 betas are now available to registered developers. They include the following features:

[*]The CUFFT Library now supports double-precision transforms and includes significant performance improvements for single-precision transforms as well. See the CUDA Toolkit release notes for details.

[*]The CUDA-GDB hardware debugger is now available for all supported Linux platforms and is included in the CUDA Toolkit installer.

[*]GPUs in an SLI group are now enumerated individually, so you can achieve multi-GPU performance even when SLI is enabled for graphics.

[*]New support for fp16 <-> conversion intrinsics allows storage of data in fp16 format with computation in fp32. Use of fp16 format is ideal for applications that require higher numerical range than 16-bit integer but less precision than fp32 and reduces memory space and bandwidth consumption.

[*]The CUDA SDK has been updated to include:

[*]A new pitchLinearTexure code sample that shows how to effeciently texture from pitch linear memory.

[*]A new PTXJIT code sample illustrating how to use cuModuleLoadDataEx() to load PTX source from memory instead of loading a file.

[*]Two new code samples for Windows, showing how to use the NVCUVID library to decode MPEG-2, VC-1, and H.264 content and pass frames to OpenGL or Direct3D for display.

[*]Updated code samples showing how to properly align CUDA kernel function parameters so the same code works on both x32 and x64 systems.

[*]The Visual Profiler (packaged separately) includes several enhancements:

[*]All memory transfer API calls are now reported

[*]Support for profiling multiple contexts per GPU.

[*]Synchronized clocks for requested start time on the CPU and start/end times on the GPU for all kernel launches and memory transfers.

[*]Global memory load and store efficiency metrics for GPUs with compute capability 1.2 and higher.

[*]The CUDA Driver for MacOS is now packaged separately from the CUDA Toolkit.

[*]Support for major Linux distros, MacOS X, and Windows:

[*]Fedora 10, RHEL 4.7 & 5.3, SLED 10.2 & 11.0, OpenSUSE 11.1, and Ubuntu 8.10 & 9.04

[*]Windows XP/Vista/7 with Visual Studio 8 (VC2005) and 9 (VC2008)

[*]MacOS X 10.5.6 and later (32-bit)

tmurray · June 20, 2009, 6:19am

#3 on that list is probably the most important for most people here (hi GTX 295 users)…

the feature I really like didn’t quite make it into the beta, so you have something to look forward to for release as well.

sergeyn · June 20, 2009, 6:28am

No vista driver for the new toolkit ?

StickGuy · June 20, 2009, 6:41am

It’s a nice change to have a current Ubuntu distribution supported while it’s still the current distribution. Things seem to be moving along pretty quickly now. Has the team gotten bigger?

cvnguyen · June 20, 2009, 7:00am

Ugh, CUDA 2.3 won’t support Ubuntu 8.04LTS, will it? That is bad news for me.

tmurray · June 20, 2009, 7:18am

Yes, and I complained a lot about the lack of 9.04 support in the original 2.3 plan… :)

sergeyn, there should be a new Vista driver as well. If it’s missing, I’ll try to find out over the weekend (but it probably won’t be resolved until Monday).

I don’t know what the status of 8.04 is in final, but I imagine it will be dropped.

sergeyn · June 20, 2009, 7:21am

Thanks.

cbuchner1 · June 20, 2009, 10:44am

cudadriver_2.3-beta_winxp_64_190.15_general.zip

That file won’t load for me. Says page cannot be displayed. Is it just me, or… ?

Christian

eyalhir74 · June 20, 2009, 7:23pm

Great, thanks !!

Does that also mean texture read/writes are counted in the GB/s throughput??

thanks

eyal

_Big_Mac · June 20, 2009, 8:08pm

Sounds very cool, waiting for the release version!

SPWorley · June 21, 2009, 3:30am

I’m looking forward to that fp16 conversion!

Is that fp16<->fp32 done in hardware (so it’s likely a one-clock op?) or is it a new library intrinsic that’s convenient but not much faster than if we had written our own functions?

cvnguyen · June 21, 2009, 3:14pm

I’ve submit an application to be a registered developer. Hope it will be accepted before CUDA 2.3 is released officially <img src=‘http://hqnveipbwb20/public/style_emoticons/<#EMO_DIR#>/crying.gif’ class=‘bbc_emoticon’ alt=‘:’(’ />

Edited: Thanks heavens, the God has heard my wish. Will try it on Kubuntu 9.04 shortly.

netllama · June 21, 2009, 4:24pm

That is correct. Ubuntu-8.04 will not be supported after CUDA_2.2.

ldpaniak · June 21, 2009, 4:54pm

It would be nice if support could be maintained for the LTS series - if there are no functionality/performance issues - at least until the next release.

Is there something missing in 8.04 (2.6.24/gcc 4.2.4) that would hold back CUDA? I don’t use X at all, so that is not an issue.

cvnguyen · June 22, 2009, 2:02am

I agree with this comment. LTS releases are preferable on servers. It will be inconvenient to upgrade the OS on servers just for letting CUDA work.

Ubuntu 8.04 LTS will be supported until mid 2011, and some people do not upgrade OS before that time. I think CUDA team should still work on it until at least mid 2010 when Ubuntu 10.04LTS is released.

Simon_Green · June 22, 2009, 11:46am

Yes, the hardware includes support for fp16 conversions, so the new intrinsics should be faster than using your own code.

elhefe38 · June 22, 2009, 12:53pm

Doesn’t work for me neither, nor the 32 bit file, nor the win7 drivers…

I have submitted a bug report.

a++

oscarb · June 22, 2009, 4:15pm

Some questions:

What about CUDA interop with OpenGL Texture Objects… They are coming in this release 2.3?

CUDA 2.3 would include a PTX version 1.5 codegen such that is compatible with the OpenCL LLVM codegen and can PTX sources compiled from one or other backend echanged?

New support for fp16 ↔ conversion intrinsics,

I suspect are these:

extern device unsigned short __float2half_rn(float);

extern device float __half2float(unsigned short);

Quoting an NVIDIA employee

“Yes, the hardware includes support for fp16 conversions, so the new intrinsics should be faster than using your own code.”

Is that hardware support, or these is the CPU emulation path?:

device_func(unsigned short __float2half_rn(float f))

{

unsigned int x = __float_as_int (f);

unsigned int u = (x & 0x7fffffff), remainder, shift, lsb, lsb_s1, lsb_m1;

unsigned int sign, exponent, mantissa;

/* Get rid of +NaN/-NaN case first. */

if (u > 0x7f800000) {

return 0x7fff;

}

sign = ((x >> 16) & 0x8000);

/* Get rid of +Inf/-Inf, +0/-0. */

if (u > 0x477fefff) {

return sign | 0x7c00;

}

if (u < 0x33000001) {

return sign | 0x0000;

}

exponent = ((u >> 23) & 0xff);

mantissa = (u & 0x7fffff);

if (exponent > 0x70) {

shift = 13;

exponent -= 0x70;

} else {

shift = 0x7e - exponent;

exponent = 0;

mantissa |= 0x800000;

}

lsb = (1 << shift);

lsb_s1 = (lsb >> 1);

lsb_m1 = (lsb - 1);

/* Round to nearest even. */

remainder = (mantissa & lsb_m1);

mantissa >>= shift;

if (remainder > lsb_s1 || (remainder == lsb_s1 && (mantissa & 0x1))) {

++mantissa;

if (!(mantissa & 0x3ff)) {

  ++exponent;

  mantissa = 0;

}

}

return sign | (exponent << 10) | mantissa;

}

device_func(float __half2float(unsigned short h))

{

unsigned int sign = ((h >> 15) & 1);

unsigned int exponent = ((h >> 10) & 0x1f);

unsigned int mantissa = ((h & 0x3ff) << 13);

if (exponent == 0x1f) { /* NaN or Inf */

mantissa = (mantissa

            ? (sign = 0, 0x7fffff)

            : 0);

exponent = 0xff;

} else if (!exponent) { /* Denorm or Zero */

if (mantissa) {

  unsigned int msb;

  exponent = 0x71;

  do {

    msb = (mantissa & 0x400000);

    mantissa <<= 1;  /* normalize */

    --exponent;

  } while (!msb);

  mantissa &= 0x7fffff;  /* 1.mantissa is implicit */

}

} else {

exponent += 0x70;

}

return __int_as_float ((sign << 31) | (exponent << 23) | mantissa);

}

Inspecting also I found

device_func(void __synchronous_start(int s))

{

/* TODO */

}

device_func(void __synchronous_end(void))

{

/* TODO */

}

What __synchronous_start and __synchronous_end are supposed to be?

I think that can be global synch points that now require separate kernel launches…

would be ready for 2.3?

Wishes:

1.Also I don’t want to worry anyone but what about CUDA multicore, it’s dead?

Related to NVCUVID, 2wishes:

Port CUVID OpenGL sample to include a CUDA video decoder sample in Linux using VDPAU…
Seems next step is a NVCUVENC sample in CUDA sample… (CUDA video encoder library)

4.Also include CUDA 64bit library for MACOS

Thanks…

The CUDA Toolkit and SDK 2.3 betas are now available to registered developers. They include the following features:

[*]The CUFFT Library now supports double-precision transforms and includes significant performance improvements for single-precision transforms as well. See the CUDA Toolkit release notes for details.

[*]The CUDA-GDB hardware debugger is now available for all supported Linux platforms and is included in the CUDA Toolkit installer.

[*]GPUs in an SLI group are now enumerated individually, so you can achieve multi-GPU performance even when SLI is enabled for graphics.

[*]New support for fp16 ↔ conversion intrinsics allows storage of data in fp16 format with computation in fp32. Use of fp16 format is ideal for applications that require higher numerical range than 16-bit integer but less precision than fp32 and reduces memory space and bandwidth consumption.

[*]The CUDA SDK has been updated to include:

[*]A new pitchLinearTexure code sample that shows how to effeciently texture from pitch linear memory.

[*]A new PTXJIT code sample illustrating how to use cuModuleLoadDataEx() to load PTX source from memory instead of loading a file.

[*]Two new code samples for Windows, showing how to use the NVCUVID library to decode MPEG-2, VC-1, and H.264 content and pass frames to OpenGL or Direct3D for display.

[*]Updated code samples showing how to properly align CUDA kernel function parameters so the same code works on both x32 and x64 systems.

[*]The Visual Profiler (packaged separately) includes several enhancements:

[*]All memory transfer API calls are now reported

[*]Support for profiling multiple contexts per GPU.

[*]Synchronized clocks for requested start time on the CPU and start/end times on the GPU for all kernel launches and memory transfers.

[*]Global memory load and store efficiency metrics for GPUs with compute capability 1.2 and higher.

[*]The CUDA Driver for MacOS is now packaged separately from the CUDA Toolkit.

[*]Support for major Linux distros, MacOS X, and Windows:

[*]Fedora 10, RHEL 4.7 & 5.3, SLED 10.2 & 11.0, OpenSUSE 11.1, and Ubuntu 8.10 & 9.04

[*]Windows XP/Vista/7 with Visual Studio 8 (VC2005) and 9 (VC2008)

[*]MacOS X 10.5.6 and later (32-bit)

tmurray · June 22, 2009, 5:51pm

sigh, yeah, something broke with the Windows drivers. we’re working on it now.

jack · June 22, 2009, 6:01pm

I was wondering about this myself, just last night. I haven’t heard a mention of it at all in some time now. Though, I would think that someone (nVidia or otherwise) will eventually release a straight OpenCL → CPU compiler, since some of the programming methods used in OpenCL could be translated to a multicore CPU + SIMD instructions (like a Pentium 4 with SSE2).

I assume that CUFFT has finally been updated to include vvolkov’s FFT code? What about CUBLAS? Are we going to see a new release of it for the 2.3 product cycle?

Topic		Replies	Views
CUDA Toolkit and SDK 2.3 released CUDA Programming and Performance	127	320671	November 3, 2009
CUDA Toolkit 3.0 beta released now with public downloads CUDA Programming and Performance	104	431102	March 25, 2010
CUDA 2.2 beta features CUDA Programming and Performance	146	127311	May 19, 2009
CUDA Toolkit and SDK v2.2 released CUDA Programming and Performance	59	65326	January 25, 2011
CUDA Toolkit 3.0 released CUDA Programming and Performance	62	26882	September 21, 2010
CUDA Toolkit 3.2 release candidate available to registered developers CUDA Programming and Performance	68	63637	December 3, 2010
New Features in CUDA 7.5 Technical Blog	66	1809	August 10, 2016
CUDA 2.1 beta CUDA Programming and Performance	49	67740	December 3, 2008
CUDA 2.0 Beta 2 GTX support, more Linux distros... CUDA Programming and Performance	29	56003	October 30, 2008
What about half-float? CUDA Programming and Performance	18	29667	October 26, 2017

CUDA Toolkit and SDK 2.3 betas available to registered developers

Related topics