CUDA Toolkit 3.0 released

tmurray · March 19, 2010, 7:46pm

Changelist and downloads are available here.

Please post any questions or comments in this thread.

Domokoen · March 19, 2010, 8:27pm

This is a very nice present:

Full BLAS support External Media

jam11 · March 19, 2010, 9:57pm

Thank you,
Now your talking.
And ,Cufft with cufftPlanMany()

allanmac · March 19, 2010, 11:12pm

This message is new in Visual Studio 2008 (EmuDebug/x64):

[indent][font=“Courier New”]1>NOTE: device emulation mode is deprecated in this release
1> and will be removed in a future release.
[/font][/indent]

The updated NVCC manual also states the feature is now deprecated.

I’m probably being dense, but is the future intent to have developers debug solely on-device (using Fermi’s debug support) and not via emulation?

-ASM

(EDIT: The answer is yes)

tmurray · March 19, 2010, 11:13pm

Yes, device emulation will be removed completely in 3.1.

MisterAnderson42 · March 19, 2010, 11:20pm

Yay! In addition to L1 cache hit/miss counters on Fermi, they also added texture cache hit/miss counters for compute 1.x chips! Can’t wait to try this out.

SPWorley · March 20, 2010, 2:20am

From the release 3.0 Programming guide (it was in the beta too and I wondered then as well…)

In section 3.2.6.3, it’s talking about Fermi’s new concurrent kernel feature:

This means that even on GF100 Fermi hardware, other CUDA apps are still shut out from the GPU when you have a context running even a single kernel. But does it also imply that native graphics are also shut out while your kernel runs, just like now? I was under the impression that we would finally escape the “OS watchdog timer kills kernels” issue by allowing a portion of the GPU to continue handling OS graphics. This would also explain why Nexus needs two G200 GPUs but only one GF100 GPU to do interactive debugging.

SPWorley · March 20, 2010, 2:31am

Actually no… Don’t forget the incredibly awesome Ocelot! It’s very useful indeed, and under active development and extension.

I will still really miss the emulator, though. I abuse it a lot to make debugging tools… basically I make a class that is #ifdefed to be a no-op on the GPU, but in CPU emulation, it can log statistics, dump data, assert conditions and consistency, etc… kind of an arbitrary diagnosis/trace/assert feature. So for example during a Monte Carlo simulation, I can dump out a list of the sample points which fufill a certain criterion, then load those lists into tools like Octaveto visualize them.

Yes, a lot of this can be done without the debug mode, using cuPrintf and the GPU debugger, but it’s not as configurable or easy.

The other huge advantage is compile speed. During development, I always compile with the emulator first, even if I don’t need to run it. Why? nvcc’s slow speed. My app compiles in about 10 seconds in emulation mode, and about 4 minutes when NVCC is generating GPU code. So when there’s a simple typo in my code, I can find and fix it in seconds when the emulator fails to compile, instead of a multiple minute turnaround time with the full nvcc call time. (Not always of course, depending on the order of the modules that need to be compiled, but often enough that it’s always better to compile the emulated version first.)

oscarb · March 20, 2010, 3:19am

For me lacking things are vs DirectCompute 5.0 that has support for it on d3d11 hardware:

*3D writable textures and writable textures with x,y addressing etc… (i.e. Surface functions?) (so I hope are no left for latter than CUDA 3.1 as Fermi hardware must support it and even Tesla arch as is used for RWTexture in DC and image writes (opencl seems to use cusurf functions ) in OpenCL driver)
*3D grids support ( and OpenCL API supports also and I hope AMD uses it on 5xxx cards)
*reporting of dual dma engine in device attributes or it will be present on all cuda compute >=2.0?

Seems Nvidia is avoiding writable textures also on OpenGL 4.0 vs d3d11 altough a EXT_image_load_store extension found on binaries…

even not considering more hard to implement:
*function pointers… (hey also for OpenGL)
*mem alloc/free in kernels
*recursion
etc…

Gregory_Diamos · March 20, 2010, 3:26am

I see that the 3.0 compiler produces ptx1.4… So we are not going to see a new version of PTX for fermi?

Mazda · March 20, 2010, 4:06am

Hi!

Where can I report a bug? Can it be here?

I believe I found a bug in this new nvcc/toolkit. Adding a texture inside a namespace cause errors in cudaMalloc, and probably in many other functions. In 2.3 toolkit this works fine.

To reproduce the bug, just add in any .cu file:

[codebox]namespace MyNamespace{

texture<float,1,cudaReadModeElementType> bugTest;

}[/codebox]

Its a runtime bug.

I`ve tested it only on a GTS 360M, sm_12, driver 197.13, visual studio 2008, Windows 7 x64, but running CUDA x86.

Artur Santos.

xcfan · March 20, 2010, 9:16am

why cufft performence in cuda3.0 is worse than cuda3.0 beta?
fft256 under cuda3.0 is 154.7Gflop/s vs cuda3.0beta 185Gflop/s
here is a test result under cuda3.0:

Device: GeForce GTX 280, 1296 MHz clock, 1024 MB memory.
Compiled with CUDA 3000.
--------CUFFT------- —This prototype— —two way—
N Batch Gflop/s GB/s error Gflop/s GB/s error Gflop/s error
8 1048576 8.7 9.3 1.7 81.8 87.2 1.6 81.5 2.2
16 524288 19.6 15.7 2.0 93.3 74.6 1.5 92.6 1.8
64 131072 85.8 45.8 1.6 185.0 98.7 2.4 184.9 2.9
256 32768 154.7 61.9 1.7 159.3 63.7 1.9 160.1 2.9
512 16384 214.9 76.4 2.1 239.4 85.1 2.5 239.6 3.7
1024 8192 150.4 48.1 2.3 212.7 68.1 2.4 212.5 3.9
2048 4096 162.2 47.2 2.6 159.3 46.3 3.0 158.7 4.5
4096 2048 158.5 42.3 2.3 171.1 45.6 3.3 170.3 4.9
8192 1024 134.0 33.0 2.3 185.2 45.6 3.4 183.2 5.2

Errors are supposed to be of order of 1 (ULPs).

another test result under cuda3.0beta as follows:

Device: GeForce GTX 280, 1296 MHz clock, 1024 MB memory.
Compiled with CUDA 3000.
--------CUFFT------- —This prototype— —two way—
N Batch Gflop/s GB/s error Gflop/s GB/s error Gflop/s error
8 1048576 8.7 9.3 1.7 81.7 87.2 1.6 81.7 2.2
16 524288 19.6 15.7 2.0 93.3 74.6 1.5 92.7 1.8
64 131072 99.1 52.9 1.6 184.5 98.4 2.4 184.6 2.9
256 32768 185.0 74.0 1.7 159.9 64.0 1.9 159.3 2.9
512 16384 238.2 84.7 2.1 239.2 85.1 2.5 240.1 3.7
1024 8192 212.0 67.8 2.3 212.7 68.1 2.4 213.1 3.9
2048 4096 179.2 52.1 2.6 159.2 46.3 3.0 158.9 4.5
4096 2048 164.6 43.9 2.3 171.1 45.6 3.3 170.7 4.9
8192 1024 156.0 38.4 2.3 185.4 45.6 3.4 183.4 5.2

Errors are supposed to be of order of 1 (ULPs).

avidday · March 20, 2010, 2:05pm

It seems the 3.0 release visual profiler has some errors in the code for computing memory throughput. While I would love to think that upgrading to 3.0 magically gave my GTX275 close to 500Gb/s global memory bandwidth, I very much doubt it… (see attached screenshot).

OS : Linux version 2.6.28-18-generic (buildd@yellow) (gcc version 4.3.3 (Ubuntu 4.3.3-5ubuntu4) ) #60-Ubuntu SMP Fri Mar 12 04:26:47 UTC 2010

Processor :  0

Model Name :  AMD Phenom(tm) II X4 945 Processor

CPU Speed :  800.000 MHz

Cache Size :  512 KB

Processor :  1

Model Name :  AMD Phenom(tm) II X4 945 Processor

CPU Speed :  800.000 MHz

Cache Size :  512 KB

Processor :  2

Model Name :  AMD Phenom(tm) II X4 945 Processor

CPU Speed :  800.000 MHz

Cache Size :  512 KB

Processor :  3

Model Name :  AMD Phenom(tm) II X4 945 Processor

CPU Speed :  800.000 MHz

Cache Size :  512 KB

Total RAM :  8077148 kB

Free RAM :  4748168 kB

NVRM version: NVIDIA UNIX x86_64 Kernel Module  195.36.15  Fri Mar 12 00:29:13 PST 2010

GCC version:  gcc version 4.3.3 (Ubuntu 4.3.3-5ubuntu4) 

Number of CUDA devices : 2

Device 0 : GeForce GTX 275

Device 1 : GeForce GTX 275

_Big_Mac · March 20, 2010, 2:59pm

Multi-vendor OpenCL is still not there, I’m affraid. :(

Gregory_Diamos · March 20, 2010, 6:21pm

Alright so I can generate ptx2.0 with -arch sm_20, but I still can’t find any documentation for 2.0. Any chance that this will be released soon?

cbuchner1 · March 20, 2010, 10:28pm

Is there a difference in the developer drivers for Windows vs. the regular 197.13 drivers? If there is a difference, what aspects of development are negatively affected by using the regular driver?

Christian

oscarb · March 20, 2010, 11:30pm

Please post PTX 1.5 and 2.0 documents…

Mac related:

Is 64bits on mac supported?

This package will work MAC OSX running 32/64-bit.

       CUDA applications built in 32/64-bit (CUDA Driver API) is supported.

       CUDA applications built as 32-bit (CUDA Runtime API) is supported.

       (10.5.x Leopard and 10.6 SnowLeopard)

Note: x86_64 is not currently working for Leopoard or SnowLeopard

UDA applications built with the CUDA driver API can run as either 32/64-bit applications.

CUDA applications using CUDA Runtime APIs can only be built on 32-bit applications.

Also I’m summing here things promised soon by Nvidia so let’s see how much it takes before we get:

*cuda-gdb support for hardware debugging of OpenCL kernels

*cuda-gdb GPU debugger for Mac (with OpenCL support also)

oscarb · March 20, 2010, 11:33pm

One more thing:

no 3.0 sources:
[url=“http://ftp%3a%2f/download.nvidia.com/CUDAOpen64”]ftp://download.nvidia.com/CUDAOpen64[/url]

tmurray · March 20, 2010, 11:49pm

No difference as far as I know.

oscarb · March 22, 2010, 1:07am

Hi this is somewhat serious limitation in OpenCL:

clGetGLContextInfoKHR function is not usable…

using it you get:

unresolved external symbol _clGetGLContextInfoKHR@20

so is not in opencl.lib

anyway I have loaded using clGetExtensionFunctionAddress and get a non null pointer but calling it correctly returns invalid value:

below pl[i] points to unique nvidia platform and gl context is created:

typedef

						CL_API_ENTRY cl_int (CL_API_CALL

						*P1)(const cl_context_properties *properties,

						cl_gl_context_info param_name,

						size_t param_value_size,

						void *param_value,

						size_t *param_value_size_ret);

					CL_API_ENTRY cl_int (CL_API_CALL

						*myclGetGLContextInfoKHR)(const cl_context_properties *properties,

						cl_gl_context_info param_name,

						size_t param_value_size,

						void *param_value,

						size_t *param_value_size_ret)=NULL;

myclGetGLContextInfoKHR=(P1)clGetExtensionFunctionAddress(“clGetGLContextInfoKHR”);

cl_context_properties props =

					{

						CL_CONTEXT_PLATFORM,         (cl_context_properties)pl[i],

						CL_GL_CONTEXT_KHR,	(cl_context_properties) wglGetCurrentContext()

					};

int N=1000;

cl_device_id cdDeviceID[1000];

size_t size;

myclGetGLContextInfoKHR(props, CL_CURRENT_DEVICE_FOR_GL_CONTEXT_KHR, N*sizeof(cl_device_id), cdDeviceID, &size);

Topic		Replies	Views
CUDA Toolkit 3.0 beta released now with public downloads CUDA Programming and Performance	104	430164	March 25, 2010
CUDA 2.1 discussion CUDA Programming and Performance	71	63945	February 17, 2009
CUDA 2.1 beta CUDA Programming and Performance	49	67186	December 3, 2008
CUDA Toolkit 3.2 release candidate available to registered developers CUDA Programming and Performance	68	63150	December 3, 2010
CUDA Toolkit 3.0 update GPU HW debugging tools to replace device emulation CUDA Programming and Performance	44	29470	April 29, 2010
CUDA 2.1 FAQ Please read before posting CUDA Programming and Performance	10	211018	January 18, 2014
New Features in CUDA 7.5 Technical Blog	66	1137	August 10, 2016
CUDA Toolkit and SDK v2.2 released CUDA Programming and Performance	59	64641	January 25, 2011
CUDA very slow performance CUDA Programming and Performance	21	16801	March 6, 2020
CUDA Toolkit and SDK 2.3 betas available to registered developers CUDA Programming and Performance	60	104587	July 22, 2009

CUDA Toolkit 3.0 released

Related topics