CUDA Toolkit 3.2 release candidate available to registered developers

As the topic says, it’s now available to registered devs, and it adds a lot of stuff. Things I like (well, things I did a lot of work for, which means I can remember them…):

  • TCC is a per-device property of Tesla cards now, so you can run TCC cards alongside WDDM cards. and one other TCC-related thing I can’t talk about yet, I think…
  • cuStreamWaitEvent does exactly what it sounds like–inter-stream synchronization without using CPU resources
  • the driver API has been reworked in major ways to support 64-bit devices

There are a ton of other new things, too. You should probably give it a try! Feel free to tell me that I ruined everything at GTC :)

A couple more highlights from the e-mail

Cool, can’t wait to try this one out.

Whoah, a LOT of people will be happy about that.

You mean that cuda-memcheck didn’t work before on Fermi? Hmm…

A couple more highlights from the e-mail

Cool, can’t wait to try this one out.

Whoah, a LOT of people will be happy about that.

You mean that cuda-memcheck didn’t work before on Fermi? Hmm…

TCC debugging in Nsight was the other TCC feature I referred to. It’s all pretty fancy.

TCC debugging in Nsight was the other TCC feature I referred to. It’s all pretty fancy.

I assume the malloc() and free() support on the device in CUDA 3.2 is the main prerequisite for C++ new and delete support in a future (maybe even next??) release?

I assume the malloc() and free() support on the device in CUDA 3.2 is the main prerequisite for C++ new and delete support in a future (maybe even next??) release?

CUSPARSE has my attention…

CUSPARSE has my attention…

Two other little things:

  • a user can now access a subset of GPUs by having RW privileges to /dev/nvidiactl and RW privileges to only a subset of the /dev/nvidia[0…n] rather than having the CUDA driver throw an error if you can’t access any of the nodes; devices that a user doesn’t have permissions to will not be visible to the app (think CUDA_VISIBLE_DEVICES version 2.0)
  • latency on streamed async copies on GF100+ is much improved

Two other little things:

  • a user can now access a subset of GPUs by having RW privileges to /dev/nvidiactl and RW privileges to only a subset of the /dev/nvidia[0…n] rather than having the CUDA driver throw an error if you can’t access any of the nodes; devices that a user doesn’t have permissions to will not be visible to the app (think CUDA_VISIBLE_DEVICES version 2.0)
  • latency on streamed async copies on GF100+ is much improved

wow… I’m presenting my master thesis on sparse matrix computations (and other) on CUDA on 22 september (spMV: 34 Gflop/s peak SP, 19 Gflop/s peak DP with gtx 285!)… lol, CUSPARSE…

wow… I’m presenting my master thesis on sparse matrix computations (and other) on CUDA on 22 september (spMV: 34 Gflop/s peak SP, 19 Gflop/s peak DP with gtx 285!)… lol, CUSPARSE…

Maybe you can try operator overloading new and delete and use malloc …

Maybe you can try operator overloading new and delete and use malloc …

Ahhm. I tested the new 64bit toolkit and code samples on two different Windows 7 x64 machines, with drivers dev260.61 and 260.63.

On the old 8800 GTS cards (only present to get Nsight running) everything’s fine, but on our 480 GTXs most Nvidia Samples and all of my own programs terminate with “cudaErrorDevicesUnavailable (all CUDA-capable devices are busy or unavailable.)”.

Any clue on that?

Ahhm. I tested the new 64bit toolkit and code samples on two different Windows 7 x64 machines, with drivers dev260.61 and 260.63.

On the old 8800 GTS cards (only present to get Nsight running) everything’s fine, but on our 480 GTXs most Nvidia Samples and all of my own programs terminate with “cudaErrorDevicesUnavailable (all CUDA-capable devices are busy or unavailable.)”.

Any clue on that?

I moved to 3.2 and when I tried to run some of my older codes

I got the following compile error ???

nvcc error : ‘cudafe’ died due to signal 11 (Invalid memory reference)
make: *** [obj/x86_64/release/main.cu_20.o] Error 11

the same code compiles without any issues with CUDA 3.1
anything I am missing out ?

Further, I have no such problems when compiling the sdk examples !

I moved to 3.2 and when I tried to run some of my older codes

I got the following compile error ???

nvcc error : ‘cudafe’ died due to signal 11 (Invalid memory reference)
make: *** [obj/x86_64/release/main.cu_20.o] Error 11

the same code compiles without any issues with CUDA 3.1
anything I am missing out ?

Further, I have no such problems when compiling the sdk examples !

another think i observed with my quick tests with 3.2 - big drop in OpenCL performance !!

i think this has to do with the new dev driver ??

anyone else able to confirm this ??