CUDA Toolkit 3.0 released

Changelist and downloads are available here.

Please post any questions or comments in this thread.

This is a very nice present:

Full BLAS support External Media

Thank you,
Now your talking.
And ,Cufft with cufftPlanMany()

This message is new in Visual Studio 2008 (EmuDebug/x64):

[indent][font=“Courier New”]1>NOTE: device emulation mode is deprecated in this release
1> and will be removed in a future release.
[/font][/indent]

The updated NVCC manual also states the feature is now deprecated.

I’m probably being dense, but is the future intent to have developers debug solely on-device (using Fermi’s debug support) and not via emulation?

-ASM

(EDIT: The answer is yes)

Yes, device emulation will be removed completely in 3.1.

Yay! In addition to L1 cache hit/miss counters on Fermi, they also added texture cache hit/miss counters for compute 1.x chips! Can’t wait to try this out.

From the release 3.0 Programming guide (it was in the beta too and I wondered then as well…)

In section 3.2.6.3, it’s talking about Fermi’s new concurrent kernel feature:

This means that even on GF100 Fermi hardware, other CUDA apps are still shut out from the GPU when you have a context running even a single kernel. But does it also imply that native graphics are also shut out while your kernel runs, just like now? I was under the impression that we would finally escape the “OS watchdog timer kills kernels” issue by allowing a portion of the GPU to continue handling OS graphics. This would also explain why Nexus needs two G200 GPUs but only one GF100 GPU to do interactive debugging.

Actually no… Don’t forget the incredibly awesome Ocelot! It’s very useful indeed, and under active development and extension.

I will still really miss the emulator, though. I abuse it a lot to make debugging tools… basically I make a class that is #ifdefed to be a no-op on the GPU, but in CPU emulation, it can log statistics, dump data, assert conditions and consistency, etc… kind of an arbitrary diagnosis/trace/assert feature. So for example during a Monte Carlo simulation, I can dump out a list of the sample points which fufill a certain criterion, then load those lists into tools like Octaveto visualize them.

Yes, a lot of this can be done without the debug mode, using cuPrintf and the GPU debugger, but it’s not as configurable or easy.

The other huge advantage is compile speed. During development, I always compile with the emulator first, even if I don’t need to run it. Why? nvcc’s slow speed. My app compiles in about 10 seconds in emulation mode, and about 4 minutes when NVCC is generating GPU code. So when there’s a simple typo in my code, I can find and fix it in seconds when the emulator fails to compile, instead of a multiple minute turnaround time with the full nvcc call time. (Not always of course, depending on the order of the modules that need to be compiled, but often enough that it’s always better to compile the emulated version first.)

For me lacking things are vs DirectCompute 5.0 that has support for it on d3d11 hardware:

*3D writable textures and writable textures with x,y addressing etc… (i.e. Surface functions?) (so I hope are no left for latter than CUDA 3.1 as Fermi hardware must support it and even Tesla arch as is used for RWTexture in DC and image writes (opencl seems to use cusurf functions ) in OpenCL driver)
*3D grids support ( and OpenCL API supports also and I hope AMD uses it on 5xxx cards)
*reporting of dual dma engine in device attributes or it will be present on all cuda compute >=2.0?

Seems Nvidia is avoiding writable textures also on OpenGL 4.0 vs d3d11 altough a EXT_image_load_store extension found on binaries…

even not considering more hard to implement:
*function pointers… (hey also for OpenGL)
*mem alloc/free in kernels
*recursion
etc…

I see that the 3.0 compiler produces ptx1.4… So we are not going to see a new version of PTX for fermi?

Hi!

Where can I report a bug? Can it be here?

I believe I found a bug in this new nvcc/toolkit. Adding a texture inside a namespace cause errors in cudaMalloc, and probably in many other functions. In 2.3 toolkit this works fine.

To reproduce the bug, just add in any .cu file:

[codebox]namespace MyNamespace{

texture<float,1,cudaReadModeElementType> bugTest;

}[/codebox]

Its a runtime bug.

I`ve tested it only on a GTS 360M, sm_12, driver 197.13, visual studio 2008, Windows 7 x64, but running CUDA x86.

Artur Santos.

why cufft performence in cuda3.0 is worse than cuda3.0 beta?
fft256 under cuda3.0 is 154.7Gflop/s vs cuda3.0beta 185Gflop/s
here is a test result under cuda3.0:

Device: GeForce GTX 280, 1296 MHz clock, 1024 MB memory.
Compiled with CUDA 3000.
--------CUFFT------- —This prototype— —two way—
N Batch Gflop/s GB/s error Gflop/s GB/s error Gflop/s error
8 1048576 8.7 9.3 1.7 81.8 87.2 1.6 81.5 2.2
16 524288 19.6 15.7 2.0 93.3 74.6 1.5 92.6 1.8
64 131072 85.8 45.8 1.6 185.0 98.7 2.4 184.9 2.9
256 32768 154.7 61.9 1.7 159.3 63.7 1.9 160.1 2.9
512 16384 214.9 76.4 2.1 239.4 85.1 2.5 239.6 3.7
1024 8192 150.4 48.1 2.3 212.7 68.1 2.4 212.5 3.9
2048 4096 162.2 47.2 2.6 159.3 46.3 3.0 158.7 4.5
4096 2048 158.5 42.3 2.3 171.1 45.6 3.3 170.3 4.9
8192 1024 134.0 33.0 2.3 185.2 45.6 3.4 183.2 5.2

Errors are supposed to be of order of 1 (ULPs).

another test result under cuda3.0beta as follows:

Device: GeForce GTX 280, 1296 MHz clock, 1024 MB memory.
Compiled with CUDA 3000.
--------CUFFT------- —This prototype— —two way—
N Batch Gflop/s GB/s error Gflop/s GB/s error Gflop/s error
8 1048576 8.7 9.3 1.7 81.7 87.2 1.6 81.7 2.2
16 524288 19.6 15.7 2.0 93.3 74.6 1.5 92.7 1.8
64 131072 99.1 52.9 1.6 184.5 98.4 2.4 184.6 2.9
256 32768 185.0 74.0 1.7 159.9 64.0 1.9 159.3 2.9
512 16384 238.2 84.7 2.1 239.2 85.1 2.5 240.1 3.7
1024 8192 212.0 67.8 2.3 212.7 68.1 2.4 213.1 3.9
2048 4096 179.2 52.1 2.6 159.2 46.3 3.0 158.9 4.5
4096 2048 164.6 43.9 2.3 171.1 45.6 3.3 170.7 4.9
8192 1024 156.0 38.4 2.3 185.4 45.6 3.4 183.4 5.2

Errors are supposed to be of order of 1 (ULPs).

It seems the 3.0 release visual profiler has some errors in the code for computing memory throughput. While I would love to think that upgrading to 3.0 magically gave my GTX275 close to 500Gb/s global memory bandwidth, I very much doubt it… (see attached screenshot).

OS : Linux version 2.6.28-18-generic (buildd@yellow) (gcc version 4.3.3 (Ubuntu 4.3.3-5ubuntu4) ) #60-Ubuntu SMP Fri Mar 12 04:26:47 UTC 2010

Processor :  0

Model Name :  AMD Phenom(tm) II X4 945 Processor

CPU Speed :  800.000 MHz

Cache Size :  512 KB

Processor :  1

Model Name :  AMD Phenom(tm) II X4 945 Processor

CPU Speed :  800.000 MHz

Cache Size :  512 KB

Processor :  2

Model Name :  AMD Phenom(tm) II X4 945 Processor

CPU Speed :  800.000 MHz

Cache Size :  512 KB

Processor :  3

Model Name :  AMD Phenom(tm) II X4 945 Processor

CPU Speed :  800.000 MHz

Cache Size :  512 KB

Total RAM :  8077148 kB

Free RAM :  4748168 kB

NVRM version: NVIDIA UNIX x86_64 Kernel Module  195.36.15  Fri Mar 12 00:29:13 PST 2010

GCC version:  gcc version 4.3.3 (Ubuntu 4.3.3-5ubuntu4) 

Number of CUDA devices : 2

Device 0 : GeForce GTX 275

Device 1 : GeForce GTX 275

Multi-vendor OpenCL is still not there, I’m affraid. :(

Alright so I can generate ptx2.0 with -arch sm_20, but I still can’t find any documentation for 2.0. Any chance that this will be released soon?

Is there a difference in the developer drivers for Windows vs. the regular 197.13 drivers? If there is a difference, what aspects of development are negatively affected by using the regular driver?

Christian

Please post PTX 1.5 and 2.0 documents…

Mac related:

Is 64bits on mac supported?

This package will work MAC OSX running 32/64-bit.

       CUDA applications built in 32/64-bit (CUDA Driver API) is supported.

       CUDA applications built as 32-bit (CUDA Runtime API) is supported.

       (10.5.x Leopard and 10.6 SnowLeopard)

Note: x86_64 is not currently working for Leopoard or SnowLeopard

UDA applications built with the CUDA driver API can run as either 32/64-bit applications.

CUDA applications using CUDA Runtime APIs can only be built on 32-bit applications.

Also I’m summing here things promised soon by Nvidia so let’s see how much it takes before we get:

*cuda-gdb support for hardware debugging of OpenCL kernels

*cuda-gdb GPU debugger for Mac (with OpenCL support also)

One more thing:

no 3.0 sources:
[url=“http://ftp%3a%2f/download.nvidia.com/CUDAOpen64”]ftp://download.nvidia.com/CUDAOpen64[/url]

No difference as far as I know.

Hi this is somewhat serious limitation in OpenCL:

clGetGLContextInfoKHR function is not usable…

using it you get:

unresolved external symbol _clGetGLContextInfoKHR@20

so is not in opencl.lib

anyway I have loaded using clGetExtensionFunctionAddress and get a non null pointer but calling it correctly returns invalid value:

below pl[i] points to unique nvidia platform and gl context is created:

typedef

						CL_API_ENTRY cl_int (CL_API_CALL

						*P1)(const cl_context_properties *properties,

						cl_gl_context_info param_name,

						size_t param_value_size,

						void *param_value,

						size_t *param_value_size_ret);

					CL_API_ENTRY cl_int (CL_API_CALL

						*myclGetGLContextInfoKHR)(const cl_context_properties *properties,

						cl_gl_context_info param_name,

						size_t param_value_size,

						void *param_value,

						size_t *param_value_size_ret)=NULL;

myclGetGLContextInfoKHR=(P1)clGetExtensionFunctionAddress(“clGetGLContextInfoKHR”);

cl_context_properties props =

					{

						CL_CONTEXT_PLATFORM,         (cl_context_properties)pl[i],

						CL_GL_CONTEXT_KHR,	(cl_context_properties) wglGetCurrentContext()

					};

int N=1000;

cl_device_id cdDeviceID[1000];

size_t size;

myclGetGLContextInfoKHR(props, CL_CURRENT_DEVICE_FOR_GL_CONTEXT_KHR, N*sizeof(cl_device_id), cdDeviceID, &size);