CUDA Toolkit 3.0 beta released now with public downloads

I did not find any new CUDA 3.0 documentation this week. Is there any way do get the new ones? Or is there an online-doxygen 3.0 documentation like it is available for 2.3?

I guess they are refreshing lots of stuff so…just in patience. Anyway I tried installed 3.0 beta and did some research on the headers. The interesting part is inside “sm_20_intrinstics.h”

There’re several new device functions: __threadfence_system() (which confuses me), __ballot(), __syncthreads_count(), __syncthreads_and(), __syncthreads_or()…so it looks like we can have a more efficient and effective synchronization mechanism implemented in Fermi. But where’s the native c++ and virtual function support? and new/delete? and create new thread inside kernel? (those are claimed to be implemented in Fermi) Also I didn’t see the cache switch (16KB/48KB)…I guess we have to wait until version 3.1 or 3.2 since there’s sm_20,sm_21,sm_22, and sm_23 options in nvcc…

(By the way, do we allowed to post such info here?)

Last time I tried gcc 4.4 failed on cuda_runtime.h.

Should be no problem according to this comment:

@theMatix:

Thanks, all the warnings disappeared when using -isystem /usr/local/cuda/include instead of -I/usr/local/cuda/include, but now I get some errors (!) instead :-)

/usr/local/cuda/include/surface_functions.h: In function 'void surf1Dread(T*, surface<void, 1>, int, int, cudaSurfaceBounderyMode)':

/usr/local/cuda/include/surface_functions.h:99: error: there are no arguments to '__surf1Dreadc1' that depend on a template parameter, so a declaration of '__surf1Dreadc1' must be available

/usr/local/cuda/include/surface_functions.h:99: error: (if you use '-fpermissive', G++ will accept your code, but allowing the use of an undeclared name is deprecated)

/usr/local/cuda/include/surface_functions.h:100: error: there are no arguments to '__surf1Dreads1' that depend on a template parameter, so a declaration of '__surf1Dreads1' must be available

/usr/local/cuda/include/surface_functions.h:101: error: there are no arguments to '__surf1Dreadu1' that depend on a template parameter, so a declaration of '__surf1Dreadu1' must be available

... more similar errors ...

/usr/local/cuda/include/surface_functions.h: In function 'void surf2Dread(T*, surface<void, 2>, int, int, int, cudaSurfaceBounderyMode)':

/usr/local/cuda/include/surface_functions.h:459: error: there are no arguments to '__surf2Dreadc1' that depend on a template parameter, so a declaration of '__surf2Dreadc1' must be available

/usr/local/cuda/include/surface_functions.h:460: error: there are no arguments to '__surf2Dreads1' that depend on a template parameter, so a declaration of '__surf2Dreads1' must be available

... more similar errors ...

With -fpermissive the errors disappear, but I don’t want to use this flag. Futhermore I’m a little bit confused why surface_functions.h gets processed, it doesn’t get included directly or indirectly. Any hint?

It is probably #included within the cpp file generated by nvcc and passed to gcc. You can use “nvcc -keep” to keep this file and have a look at it.

Better synchronization definately sounds good! I’ve been quite frustrated trying to optimise around __syncthreads recently!

I wonder what __threadfence_system() is. The rest I can make a guess at, but that seems confusing.

You’re right, common_functions.h gets included which indirectly includes surface_functions.h. But I have no idea how to get rid of the errors without -fpermissive. I suppose this is a bug in 3.0, because the same configuration works just fine with 2.3.

Publicly available download links are now available in the first post.

Uhh… Thanks ;) Must have been an overwhelming demand.

Disappointingly still only with the 2.3 programming guide.

Quick note…in the 2.3 programming guide, section 3.2.3 there is a code block with a line:

if (dev == 0) {

But I think it’s supposed to be:

if (device == 0) {

since there is no “dev” variable defined there. Not a big deal but it might be confusing to someone reading through the manual for the first time.

Isn’t this supposed to be in announcement/news forum ?

Hi,

For some reason, a clean install on Snow Leopard (i.e. previously deleting /usr/local/cuda and /Developer/GPU Computing) resulted in the cuda tool kit being installed with no read/execute permissions for normal users, which caused building the SDK to fail due to linker errors (i.e. -lcudart not found). “chmod -R a+rx /usr/local/cuda” fixed this.

Mu-Chi, thanks for starting to dig!

So what’s everyone’s speculation?

Here’s just guesses to start the analysis.

__threadfence_system() does a thread fence wait that guarantees writes are resolved not just for my kernel, but for all kernels on the device. To allow one kernel to send data to another and know that the data’s been resolved and is now accessible. So you’d write out a “here’s the data for the job you need to do”, then threadfence_system(), then fire off an atomic or kernel event (or kernel launch!?) so the other kernel can work on it.

Alternate hypothesis: a manual flush of the L1 caches of all SMs, perhaps necessary when writing to system memory that other kernels (in different address space!???) will read.

__ballot() I hope, oh so much, that this is a voting mechanism that lets each thread in a warp vote, then every thread in the warp is given a bitmask of the 32 vote results. This would be awesome. I posted this idea to the wishlist thread a long time ago. It would be great to coordinate work within a warp. (Who has work? Ballot! Looks like 7 of us have work to do and thread #3 is the lowest bit set, so let’s all follow the job thread #3 has for us.)

__syncthreads_count Acts like syncthreads, but takes an argument of an integer (each thread may pass in a different argument). At the return of the syncthreads(), the sum of the arguments over all threads is returned. This is like doing a reduction in one statement. Even if it’s not any faster than doing it yourself, it’s really convenient! _and and _or versions do the same for bitwise masks. Maybe this is Fermi-only since it’d use L1 cache to transparently hold the reduction intermediate storage. The _count() suffix instead of _add() may mean that it’s a boolean counter and not an arbitrary integer you can add… but still very useful.

The above is pure speculation based entirely on the names of the new intrinsics.

If my syncthreads_count() hypothesis is correct, it’d also be great to have the same kind of “one-line reduction” ability for a single warp. (Ha, I’m already starting with the feature requests…)

Maybe __threadfence_system() is a threadfence for zero-copy memory copy over the PCIe bus?

I bet you’re right! That’d be useful for sending data to and from the CPU, especially job and work assignments. This will be a lot more common with Fermi where we will likely have the option of persistent kernels (no timeouts).

This is also assuming we can persistently allocate just some of the SMs to the video display… seems feasible, and very, very useful, though perhaps tricky on the video driver side.

Hello,

There are lots of exciting new features in this release. I’m looking forward to taking it for a ride. Unfortunately the Windows XP 32 notebook drivers included in this thread do not install correctly on my system. I’m running Windows XP 32 bit under Bootcamp on a Macbook Pro laptop with a 9600M GT GPU.

The error message is “The NVIDIA setup program could not locate any drivers that are compatible with your current hardware. Setup will now exit.”

The released 190.38 Windows XP 32 notebook CUDA drivers do install correctly on this machine though, so I am able to run CUDA applications built against the 2.3 toolkit.

Cheers,
Mike

A common problem, unrelated to CUDA, really. Basically your OEM hardware is slightly custom and in theory you should get drivers from that manufacturer so they can customize it for your exact hardware.

In practice this is just an added INF file.

But you can do it yourself with non OEM drivers. Bookmark this site: http://www.laptopvideo2go.com/

Wow, Thank you very much.
Downloaded, installed, and running in no time.
A few glitches while running samples with images. No immediate release when the cursor is on the image and I try to shut down the application.
One bug in compliling :
error g++ /include not found.
To solve, I did a make in the subdirectory and than another make in the main directory. I could not repeat the problem.
It seems to be running even faster than the previous release.

Compiling string failed in cuda 3.0 beta

[url=“http://forums.nvidia.com/index.php?showtopic=150558”]The Official NVIDIA Forums | NVIDIA