CUDA Toolkit 3.0 beta released now with public downloads

The CUDA Toolkit 3.0 Beta is now available.

Highlights for this release include:

    CUDA Driver / Runtime Buffer Interoperability, which allows applications using the CUDA Driver API to also use libraries implemented using the CUDA C Runtime.

    A new, separate version of the CUDA C Runtime (CUDART) for debugging in emulation-mode.

    C++ Class Inheritance and Template Inheritance support for increased programmer productivity

    A new unified interoperability API for Direct3D and OpenGL, with support for:

      OpenGL texture interop

      Direct3D 11 interop support

    cuda-gdb hardware debugging support for applications that use the CUDA Driver API

    New CUDA Memory Checker reports misalignment and out of bounds errors, available as a debugging mode within cuda-gdb and also as a stand-alone utility.

    CUDA Toolkit libraries are now versioned, enabling applications to require a specific version, support multiple versions explicitly, etc.

    CUDA C/C++ kernels are now compiled to standard ELF format

    Support for all the OpenCL features in the latest R195.39 beta driver:

      Double Precision

      OpenGL Interoperability, for interactive high performance visualization

      Query for Compute Capability, so you can target optimizations for GPU architectures (cl_nv_device_attribute_query)

      Ability to control compiler optimization settings, etc. via support for NVIDIA Compiler Flags (cl_nv_compiler_options)

      OpenCL Images support, for better/faster image filtering

      32-bit Atomics for fast, convenient data manipulation

      Byte Addressable Stores, for faster video/image processing and compression algorithms

      Support for the latest OpenCL spec revision 48 and latest official Khronos OpenCL headers as of 11/1/2009

    Early support for the Fermi architecture, including:

      Native 64-bit GPU support

      Multiple Copy Engine support

      ECC reporting

      Concurrent Kernel Execution

      Fermi HW debugging support in cuda-gdb

For more information on general purpose computing features of the Fermi architecture, see:

Windows developers should be sure to sign up for the Nexus (codename) beta program, and test drive the integrated support for GPU hardware debugging, profiling, and platform trace/analysis features at:

Please review the errata document for important notes about using this beta release.

Special Notice for MacOS X Developers

    Use the cudadriver_3.0.0-beta1_macos.pkg driver with all NVIDIA GPUs except Quadro FX 4800 and GeForce GTX 285 on MacOS X 10.5.6 and later (pre-SnowLeoard)

    Use cudadriver_3.0.1-beta1_macos.pkg driver with Quadro FX 4800 and GeForce GTX 285 on MacOS X 10.5.6 and later (pre-SnowLeoard).

    Use cudadriver_3.0.1-beta1_macos.pkg driver with all NVIDIA GPUs on MacOS X 10.6 SnowLeopard and later.


Getting Started - Linux

Getting Started - OS X

Getting Started - Windows

XP32 195.39

XP64 195.39

Vista/Win7 32 195.39

Vista/Win7 64 195.39

Notebook XP32 195.39

Notebok XP64 195.39

Notebook Vista/Win7 32 195.39

Notebook Vista/Win7 64 195.39

Linux 32 195.17

Linux 64 195.17

3.0.0 for Non-GT200 Leopard

3.0.1 for GT200 Leopard and Snow Leopard

CUDA Toolkit for Fedora 10 32-bit

CUDA Toolkit for RHEL 4.8 32-bit

CUDA Toolkit for RHEL 5.3 32-bit

CUDA Toolkit for SLED 11.0 32-bit

CUDA Toolkit for SuSE 11.1 32-bit

CUDA Toolkit for Ubuntu 9.04 32-bit

CUDA Toolkit for Fedora 10 64-bit

CUDA Toolkit for RHEL 4.8 64-bit

CUDA Toolkit for RHEL 5.3 64-bit

CUDA Toolkit for SLED 11.0 64-bit

CUDA Toolkit for SuSE 11.1 64-bit

CUDA Toolkit for Ubuntu 9.04 64-bit

CUDA Toolkit for OS X

CUDA Toolkit for Windows 32-bit

CUDA Toolkit for Windows 64-bit

CUDA Profiler 3.0 Beta Readme

CUDA Profiler 3.0 Beta Release Notes for Linux

CUDA Profiler 3.0 Beta Release Notes for OS X

CUDA Profiler 3.0 Beta Release Notes for Windows



CUDA-GDB User Manual

CUDA Reference Manual

CUDA Toolkit Release Notes for Linux

CUDA Toolkit Release Notes for OS X

CUDA Toolkit Release Notes for Windows

CUDA Programming Guide

CUDA Best Practices Guide

Online Documentation

GPU Computing SDK for Linux

GPU Computing SDK for OS X

GPU Computing SDK for Win32

GPU Computing SDK for Win64

CUDA SDK Release Notes

DirectCompute Release Notes

OpenCL Release Notes

GPU Computing EULA

Documentation updates:

Fermi Compatibility Guide
Fermi Tuning Guide
Preview: CUDA Programming Guide for CUDA Toolkit 3.0

Is it possible to have the “CUDA 3 Beta Programming Guide” available for separate download?

This would provide a way for me to learn more about the release without downloading/installing the entire SDK.

It just bluescreened when I tried to run my code under profiler. I have a minidump if anyone from nvidia would like to take a look

The toolkit beta teases you with new docs labelled “CUDA 3 Beta Programming Guide” but they are just the 2.3 docs.

That’s the first thing I wanted to look at!

The best practices guide, nvcc docs, and PTX spec are also all 2.3 versions.

I checked both Linux and Windows.

There are new 3.0 CCUBLAS and CUFFT beta docs though.

Tim, are we allowed to openly discuss the 3.0 toolkit beta here? The rules were relaxed for 2.3 beta and that was nice to allow forum discussion.

There’s some promising new features in 3.0 even ignoring the Fermi support!

yeah, feel free to discuss as per usual.

I’ll investigate the documentation packaging issue tomorrow…

Nice! but my code runs now 3 times slower vs 2.3 sdk! ptx generated code is the same, so is it the driver? is it the toolkit? I know this is a beta release.

I downloaded the driver, toolkit, sdk last night (openSuSE 11.1, 64-bit) and just compiled it and ran first tests. nbody and my own test examples (sgemm/dgemm others) run just as fast as before. The setup is 2x GTX260. I have one code with runs on dual GPUs using data partitioning and QThreads (from PyQt/Qt) and it does fine. What I did notice, since my dual-GPU code reports both the total time spent in the kernel and the elapsed time for the whole thread (data is already on the GPU), is that the CUDA runtime initialization part is much faster. It used to be that there was an extra 0.6 to 1.5 seconds(!) to be added to the pure kernel run time. They are now almost the same:

[codebox]+++< GPU count: 2 GPUs >+++

All GPUs in parallel - one thread per GPU

GPU-0 : GeForce GTX 260 (1.3) - sum : 0.689 sec [404.1 GFlops]

GPU-1 : GeForce GTX 260 (1.3) - sum : 0.692 sec [402.8 GFlops]

GPU-0 : GeForce GTX 260 (1.3) - run : 2.105 sec [132.3 GFlops]

GPU-1 : GeForce GTX 260 (1.3) - run : 2.108 sec [132.1 GFlops]

All GPUs - run : 2.185 sec [254.9 GFlops]

Last two elements GPU: 1.019e-03 3.220e-04

+++< CPU count: 2 CPUs >+++

All CPUs in parallel - one thread per CPU

Thread-0 - sum : 708.240 sec [ 3.4 GFlops]

Thread-0 - run : 708.271 sec [ 3.4 GFlops]

Thread-1 - sum : 717.896 sec [ 3.4 GFlops]

Thread-1 - run : 717.946 sec [ 3.4 GFlops]

All CPU - run : 717.947 sec [ 6.8 GFlops]

Last two elements CPU: 1.019e-03 3.220e-04

Speedup: 328.5

max error at : 3929812 12 49 98 6.52e-01 6.53e-01 2.93e-04

L2 rel error : 2.5e-04

max rel error : 4.5e-04


The speedups are based on times, not flops values. Flops are defined differently on the CPU and GPU, as it involves trigs (One trig counted as 4 flops GPU and 140 flops CPU). It is now (no CPU part ,as I do not want to wait 12 minutes for the CPU):

[codebox]+++< GPU count: 2 GPUs >+++

All GPUs in parallel - one thread per GPU

GPU-0 : GeForce GTX 260 (1.3) - sum : 0.698 sec [399.2 GFlops]

GPU-0 : GeForce GTX 260 (1.3) - run : 0.841 sec [331.2 GFlops]

GPU-1 : GeForce GTX 260 (1.3) - sum : 0.700 sec [397.8 GFlops]

GPU-1 : GeForce GTX 260 (1.3) - run : 0.844 sec [329.9 GFlops]

All GPUs - run : 1.026 sec [543.1 GFlops]

Last two elements GPU: 1.019e-03 3.220e-04


As now OpenCL is included with the driver, I have an issue with that. Last week I wrote ctypes-based Python bindings for OpenCL, like I did for CUDA itself previously (python-cuda). With driver 190.29 a device query gave me




I now get (I changed the format to hex to see whats going on):


CL_DEVICE_MAX_WORK_ITEM_SIZES : 0x200 0x7F6800000200 0x7F6800000040


CL_DEVICE_MAX_WORK_ITEM_SIZES is queried passing a pointer to (size_t*3), i.e. a ctypes array of 3 64-bit longs . The first element is correct, the second and third contain “garbage” (that looks like a device address ?) in the high-order bits and the correct value in the low 32 bits. size_t is 64-bit, passing a pointer to an array of 3 32-bit ints just gives all 0. So 64-bit, as per OpenCL spec, should be correct.

What happened to OpenCL between 190.29 and 195.17? In addition I also get a double free error thrown by glibc on one machine and a segmentation fault on another, after the code ran successfully. Just to be clear: I am calling OpenCL from Python. vector_add example and a simple bandwidth test work just fine, also all data returned in the device query are basically correct.

The segmentation fault was apparently caused by a bug in my code. After a rewrite it goes away. However the MAX_WORK_ITEM_SIZES issue still exits. The C++ SDK sample gives the correct result, Python::OpenCL (don’t recall it’s exact name right now), based on Cython, also gives a correct answer. PyOpenCL, based on boost.python also gets one value wrong (the last dimension). Despite cl_khr_fp64 available, both my ctypes OpenCL bindings and PyOpenCL report preferred vector width double as 0, while the SDK reports 1.

As far as speed is concerned, I see no slow-down with 3.0b1, tested in 9650M GT, dual GTX-260 and dual GTX-280. I do notice, that X11 windows seem to pop up significantly faster than with previous driver version.

I wonder why the ‘sdk’ folder within the CUDA-SDK is an exact copy of the SDK itself!? So every example (cuda and opencl), library, etc. exists twice.


I get 1403 “unused parameter X” warnings from nvcc when compiling my programs with “–compiler-options -Wall,-Wextra” in the following header files:









There are a lot of “declared ‘static’ but never defined” warnings, too. I’m using gcc 4.3.2 on a 64 bit linux machine. In spite of the warnings everything seems to work fine, but in fact a little bit slower than with 2.3.

Would love a repro case…

I’ve tried 190.42 and 195.17 beta drivers on Ubuntu 9.10 64 using CUDA SDK 2.3 and 3.0 beta and gcc 4.3
I’m using 2 devices 285 GTX, and my code is set to use both devices (SLI is OFF). Also I use a 3rd card (8400 GS) for display (not for CUDA).

190.42 + SDK 2.3 = 13 seconds
195.17 + SDK 2.3 || 195.17 + SDK 3.0 beta = 34 seconds !

I’ve checked the 8400 is not being used anytime.

Then I supose it’s a driver problem (I really hope).
However Nbody demo performs better at 195.17 + SDK 3.0 (up to 500 GFLOPs) but smokeparticles also mess up performace :(

I’ve a gcc-4.4,
with the 2.3 cuda sdk/toolkit I use the ‘–compiler-biindir’ option to chose the gcc-4.3
with nvcc in 3.0 beta, this option is probably bad parsed :
with : "–compiler-bindir=/usr/bin/gcc-4.3’
I’ve the error : unsuported compiler ‘/usr/bin/gcc-4’

my solution (hack) is to unlink all /usr/bin/{gcc,g++,cpp, … } who point to 4.4 and make links to the 4.3.

How do I become a registered developer?

Sign up as a “GPU Computing Developer” here:…er_program.html

As reported in this thread, I was having some problems with CUDA and Windows 7:

I installed the new 3.0-beta drivers, toolkit and SDK and tried running some of the examples, and I’m still having the same problem (kernels take several seconds before executing, and the entire system freezes during that time).

I went into the Nvidia Control Panel and disabled my 2nd, 3rd, and 4th monitors and enabled the multi-GPU acceleration; now the examples run just fine, but when I run the deviceQueryDrv example, it only shows a single device. Since I’m not running displays on the other 3 GPUs (I have 2x GTX 295’s), why don’t they show up? Also, the device query on the device that does show up says that there is no time limit on kernel execution.

EDIT: Does anyone know if the PTX version will increase to version 1.5 for this release of the CUDA driver? The 3.0-beta toolkit includes the PTX 1.4 specification.

In my experience those are caused from gcc when invoked from nvcc. You should be able to shut them up by telling gcc that those are system directories and it should not warn you about errors within those files. (BTW a lot easier if you use CMake to invoke the compilation.)

I noticed there now is a --multicore (and even --multicore-llvm) switch in the compiler, however the headers disable compilation if this switch is used for all compilers but MSVC. Is the multicore support on linux planned for 3.0 final?

are there any problems with gcc4.4 (of ubuntu9.1) and cudaSDK 3.0beta ?

if not, i would like to register as a developer. Can I do this as a hobby-programmer just for simple tests ?

They do ask for company size, job position, area of work and such things. You can always specify a company size of “1” ;)


this is no way to get beta feedback.

I do not want to report all these things just to test a beta version