CUDA Toolkit and SDK v2.2 released

The CUDA Toolkit and SDK v2.2 is now released and available to all developers.

A brief overview of features–there are a lot:

  • Officially adds support for Windows 7, Server 2003, Server 2008, Ubuntu 8.10, RHEL 5.3, and Fedora 10
  • Includes cuda-gdb (hardware GPU debugger) for RHEL5 32 and 64-bit (officially supported and tested), but it may work on more platforms than just those
  • Exclusive device mode in Linux: set some GPUs as exclusive-compute (can only own a single CUDA context) and some as non-compute (no CUDA contexts allowed) for easier management of clusters/MPI applications. See the manpage for nvidia-smi for how to set this and cudaSetValidDevices in the reference manual on how to best use this from CUDART.
  • Zero-copy support: transparently read from certain system memory from a kernel on GT200 or MCP79 systems. See this post for more information on how it works.
  • Asynchronous memcpy support on Vista/Server 2008/Win7
  • Texture from pitchlinear memory: use this to avoid an additional memcpy at times in some scenarios.
  • 4GB of pinned memory in a single allocation on most OSes

  • maximum pinned memory per allocation increased in Vista to ~1.5GB
  • pinned memory can be shared between contexts
  • Multi-device OpenGL interop performance between a Quadro display card and a separate compute card is dramatically improved.
  • Visual Profiler works on Vista
  • Visual Profiler supports additional counters for GT200 to measure number of memory transactions of a given size, instruction throughput, etc.
  • Blocking sync support for all platforms: allows the host thread to sleep and be awoken by driver when the GPU operation the host thread is waiting on is completed.
  • Quite a few additional math functions added due to forum requests (feel free to keep posting requests, we do pay attention)
  • __threadfence(): ensure that a thread’s pending memory writes are visible to all threads before continuing. It is explicitly not a global sync, unlike how it appears to some.
  • Lots of bugfixes, of course; most importantly, killing a CUDA app should behave much, much better than it ever has before, especially when you’re on a dedicated compute card

Downloads

Driver
185.85 for WinXP 32 Desktops/Workstations
185.85 for WinXP 32 Notebooks
185.85 for WinXP 64 Desktops/Workstations
185.85 for WinXP 64 Notebooks
185.85 for Vista 32 Desktops/Workstations
185.85 for Vista 32 Notebooks
185.85 for Vista 64 Desktops/Workstations
185.85 for Vista 64 Notebooks
185.85 for Win7 32 Desktops/Workstations
185.85 for Win7 32 Notebooks
185.85 for Win7 64 Desktops/Workstations
185.85 for Win7 64 Notebooks

185.18.08 for 32-bit Linux
185.18.08 for 64-bit Linux

Toolkit
CUDA Toolkit 2.2 for Fedora 10 32-bit
CUDA Toolkit 2.2 for Fedora 9 32-bit
CUDA Toolkit 2.2 for RHEL4.7 32-bit
CUDA Toolkit 2.2 for RHEL 5.3 32-bit
CUDA Toolkit 2.2 for SLED 10 SP2 32-bit
CUDA Toolkit 2.2 for SUSE 11 32-bit
CUDA Toolkit 2.2 for SUSE 11.1 32-bit
CUDA Toolkit 2.2 for Ubuntu 8.04 32-bit
CUDA Toolkit 2.2 for 32-bit Ubuntu 8.10

CUDA Toolkit 2.2 for Fedora 10 64-bit
CUDA Toolkit 2.2 for Fedora 9 64-bit
CUDA Toolkit 2.2 for RHEL 4.7 64-bit
CUDA Toolkit 2.2 for RHEL 5.3 64-bit
CUDA Toolkit 2.2 for SLED 10 SP2 64-bit
CUDA Toolkit 2.2 for SUSE 11 64-bit
CUDA Toolkit 2.2 for SUSE 11.1 64-bit
CUDA Toolkit 2.2 for Ubuntu 8.04 64-bit
CUDA Toolkit 2.2 for Ubuntu 8.10 64-bit

CUDA Toolkit 2.2 for MacOS 10.5

CUDA Toolkit 2.2 for 32-bit Windows
CUDA Toolkit 2.2 for 64-bit Windows

Debugger and Profiler
(Note that the Visual Profiler is included with the toolkit on all platforms except Mac OS.)
CUDA-GDB for RHEL 5.3 32-bit
CUDA-GDB for RHEL 5.3 64-bit
CUDA Visual Profiler for Mac OS 10.5

SDK
CUDA SDK 2.2 for Linux
CUDA SDK 2.2 for Mac
CUDA SDK 2.2 for 32-bit Windows
CUDA SDK 2.2 for 64-bit Windows

Documentation
CUDA-GDB 2.2 User’s Manual
CUDA 2.2: Getting Started in Linux
CUDA 2.2: Getting Started in Mac OS
CUDA 2.2: Getting Started in Windows
CUDA 2.2 API Reference Manual (now with links!)
CUDA 2.2 Programming Guide
CUDA Visual Profiler 1.2 Readme
CUDA Visual Profiler EULA
CUDA Toolkit 2.2 EULA
CUDA 2.2 Release Notes for Linux
CUDA 2.2 Release Notes for Mac OS
CUDA 2.2 Release Notes for Windows

Errata

/MD CUDART
We have an prerelease version of a Windows CUDART for VS2005 (cudart_md_vc8.zip) and VS2008 (cudart_md.zip) compiled with /MD, so you can probably use it from managed-C++ apps and .NET and all of those things that I know very little about.

We haven’t tested it extremely thoroughly, so it may have some slightly weird behaviors (although we encountered zero problems with it). We’re considering moving away from /MT DLLs and moving exclusively to /MD, so if you feel strongly about that either way you should probably let us know as well (and test this too).

Instructions on how to set it up are in the zip file. (nvcc.profile now contains the correct paths)

Missing transpose white paper
The white paper for the transposeNew SDK sample was left out of the Windows SDK packages. Until we can update the SDK package to fix this, we’ve posted the white paper on the forum in this thread.

Erroneous release note on the Visual Profiler under Vista
The release notes still claim that the Visual Profiler does not function under Vista. This is a mistake–the Visual Profiler works as of 2.2.

Exclusive mode settings keep being reset
This happens when you aren’t running X. The workaround for now is to leave nvidia-smi running in the background:

nvidia-smi --loop-continuously --interval=60 --filename=/var/log/nvidia-smi.log &

or something to that effect. Once you do that, you can set exclusive mode, and the settings will persist so long as nvidia-smi remains running in the background.

Wow! You’re moving fast.

I prefer the /MT DLLs since that’s how I tend to build all of my apps. I think having both available would be best.

Thanks for making this release available!

Peter

I do like /MT better as well. Other than some reasons I mentioned in a different thread, I’d have to go into a gazzillion custom build steps, and project properties pages to change all /MT into /MD.

Alex

The problem with /MT is that its functionality is a subset of what /MD gives you.

We might do both since we’re looking at other CUDART improvements in the next version, but we’d really rather not…

Quick question: if we update to this driver now, will it be compatible with code build with CUDA 2.0 and CUDA 2.1? We’d like to do a phased update, and updating the driver (on Linux, btw) would be the least stressful. Thanks!

Yes,
you can install an old toolkit on a new driver, not vice versa

Hear hear,

just in time, ready to clone 20 PC’s tomorrow !

Tim:

Congrats to you and everyone on the NVIDIA team. 2.2 is a real advancement… for the zero copy memory feature alone!

Please make sure the driver/toolkit/SDK/developer team knows how pleased we are with this great progress!

External Image I managed to nearly double my thruput with the new driver (still using the 2.1 toolchain.) Block configurations that used to return immediatly now actually does its work - and on time, meeting RT deadlines as well.

Hugs and kisses to everyone involved! External Image

Ack, the appendices are still not in the PDF table of contents for the programming guide!

The links in the reference manual are a huge and very welcome improvement.

I haven’t had a chance to test it yet, but if:

is true, then thats a cause of celebration on my end!

Thanks for the next DOT ONE

Hi,

is it just me or the loop unroller in this new version is somewhat broken?

In code that worked fine (including the unrolling) in 2.1 I now get errors such as;

Signal: Segmentation fault in Global Optimization – MainOpt emitter phase.
(0): Error: Signal Segmentation fault in phase Global Optimization – MainOpt emitter – processing aborted
*** Internal stack backtrace:
/usr/local/cuda/open64/lib//be [0x6a350f]
/usr/local/cuda/open64/lib//be [0x6a4159]
/usr/local/cuda/open64/lib//be [0x6a38ad]
/usr/local/cuda/open64/lib//be [0x6a4af6]
/lib64/libc.so.6 [0x31f40322a0]
/usr/local/cuda/open64/lib//be [0x41bd72]
/usr/local/cuda/open64/lib//be [0x4fe153]
/usr/local/cuda/open64/lib//be [0x4fe2e5]
/usr/local/cuda/open64/lib//be [0x4fe330]
/usr/local/cuda/open64/lib//be [0x420711]
/usr/local/cuda/open64/lib//be [0x47821d]
/usr/local/cuda/open64/lib//be [0x4043a2]
/usr/local/cuda/open64/lib//be [0x40502e]
/usr/local/cuda/open64/lib//be [0x406081]
/usr/local/cuda/open64/lib//be [0x4073ad]
/lib64/libc.so.6(__libc_start_main+0xfa) [0x31f401e32a]
/usr/local/cuda/open64/lib//be [0x4037ea]
nvopencc INTERNAL ERROR: /usr/local/cuda/open64/lib//be died due to signal 4

Is this a bug or a feature?

Pretty should all compiler crashes would be a bug–can you post a repro?

This is Fedora 9, x86-64, CUDA Toolkit 2.2. I stripped down the kernel so as to try and isolate the offending functions. Attached is code that crashes nvcc as pasted above. Disregard the actual semantic meaning of the code, as it was lost while stripping it down. The default nvcc parameters (i.e. no special switches) seem to be enough to crash it.
test.cu.tar.gz (641 Bytes)

I’m having problems with KDE4 on the new driver. I get the following error:

/usr/bin/Xorg: double free or corruption (fasttop): 0x0000000000eb0ad0

Any help would be much appreciated!

Hi all,

it seems there is a bug in cudaHostGetDevicePointer. I allocate a single memory area using cudaHostAlloc and then I try to use it as a memory pool. however when I pass in an offset inside the memory pool, cudaHostGetDevicePointer points the device pointer back to first element in the pool, instead of the appropriate offset.

[codebox]#include <stdio.h>

#include <stdlib.h>

#include <cutil_inline.h>

#include <cuda.h>

#include <pthread.h>

#define VECTOR_SIZE 1048576

global void SimpleKernel(float *a, float *b, float *c, int N)

{

int idx = blockIdx.x*blockDim.x + threadIdx.x;

if (idx < N)

a[idx] = a[idx] + a[idx];

}

void* Function1(void* pool)

{

float *d_a;

dim3 block(256);

dim3 grid(256);

cutilSafeCall(cudaHostGetDevicePointer((void **)&d_a, (void *)pool, 0));

SimpleKernel<<<block, grid>>>(d_a, NULL, NULL, VECTOR_SIZE);

cutilCheckMsg(“Execution in thread 1 has failed\n”);

cudaThreadSynchronize();

return NULL;

}

void* Function2(void* pool)

{

float *d_a;

dim3 block(256);

dim3 grid(256);

cutilSafeCall(cudaHostGetDevicePointer((void **)&d_a, (void )((float)pool), 0));

//

// Comment this line off to get the desired behaviour

// d_a = d_a+VECTOR_SIZE;

SimpleKernel<<<block, grid>>>(d_a, NULL, NULL, VECTOR_SIZE);

cudaThreadSynchronize();

cutilCheckMsg(“Execution in thread 2 has failed\n”);

return NULL;

}

int main(int argc, char **argv)

{

int n, nelem, idev, deviceCount;

char *device = NULL;

unsigned int flags;

size_t bytes;

float *a; // Pinned memory allocated on the CPU

cutilSafeCall(cudaSetDevice(0));

cutilSafeCall(cudaSetDeviceFlags(cudaDeviceMapHost));

/* Allocate mapped CPU memory. */

nelem = VECTOR_SIZE;

bytes = nelem*sizeof(float);

printf(“allocating %llu bytes\n”, bytes);

flags = cudaHostAllocMapped | cudaHostAllocPortable;

cutilSafeCall(cudaHostAlloc((void **)&a, 2*bytes, flags));

/* Initialize the vectors. */

for(n = 0; n < 2*nelem; n++)

{

a[n] = n;

}

printf(“values %f %f\n”, a[10], a[1048590]);

b = a +VECTOR_SIZE;

Function1( a );

printf(“values %f %f\n”, a[10], a[1048590]);

Function2( b );

printf(“Releasing CPU memory…\n”);

printf(“values %f %f\n”, a[10], a[1048590]);

cutilSafeCall(cudaFreeHost(a));

cudaThreadExit();

cutilExit(argc, argv);

}[/codebox]

The results I get are:

values 10.000000 1048590.000000

values 20.000000 1048590.000000

values 40.000000 1048590.000000

If I manually set the offset to the device pointer, I get the correct results:

values 10.000000 1048590.000000

values 20.000000 1048590.000000

values 20.000000 2097180.000000

Am I missing something?

thank you,

calin

Sounds like a bug (actually it sounds like undefined/poorly defined behavior). I will ping the appropriate people to see if/how/when this works.