CUDA 1.0 FAQ (OBSOLETE) Frequently asked questions about CUDA

NVIDIA CUDA FAQ version 1.0

This document is intended to answer frequently asked questions seen on the CUDA forums.

PLEASE NOTE: this version of the FAQ is now obsolete, the most recent version is posted here:


General questions

[*] What is NVIDIA CUDA?

NVIDIA� CUDA� technology is a fundamentally new computing architecture that enables the GPU to solve complex computational problems in consumer, business, and technical applications. CUDA (Compute Unified Device Architecture) technology gives computationally intensive applications access to the tremendous processing power of NVIDIA graphics processing units (GPUs) through a revolutionary new programming interface. Providing orders of magnitude more performance than current CPUs and simplifying software development by extending the standard C language, CUDA technology enables developers to create innovative solutions for data-intensive problems. For advanced research and language development, CUDA includes a low level assembly language layer and driver interface.

[*] What is NVIDIA Tesla™?

NVIDIA Tesla is a family of GPU computing solutions aimed at high performance computing (HPC) applications. The family is built around the NVIDIA Tesla GPU Computing Processor, a dedicated computing board. The family includes the NVIDIA Tesla Deskside Supercomputer, a scalable computing system that includes two NVIDIA Tesla GPUs and attaches to a PC or workstation through an industry-standard PCI-Express connection, and the NVIDIA Tesla GPU Computing Server, a 1U server housing up to eight NVIDIA Tesla GPUs.

[*] What kind of performance increase can I expect using GPU Computing over CPU-only code?

This depends on how well the problem maps onto the architecture. For data parallel applications, we have seen speedups anywhere from 10x to 200x.

[*] What operating systems does CUDA support?

CUDA supports Windows XP and Linux, both 32-bit and 64-bit versions. Windows Vista is not currently supported.

NVIDIA supports CUDA on the following Linux distributions:

    [*] Red Hat Enterprise Linux 3.8

    [*] Red Hat Enterprise Linux 4.3

    [*] Red Hat Enterprise Linux 4.4

    [*] Red Hat Enterprise Linux 5.0

    [*] SUSE Linux Enterprise Desktop 10.0

    [*] SUSE Linux 10.1

    [*] SUSE Linux 10.2

Users have reported that CUDA works on other Linux distributions including Fedora Core, SuSE, and Ubuntu, and Gentoo, but these are not offically supported.

[*] What GPUs does CUDA run on?

CUDA 1.0 supports the following NVIDIA GPUs. The compute capability version is indicated in brackets.

    [*] Tesla C870 (1.0)

    [*] Tesla D870 (1.0)

    [*] Tesla S870 (1.0)

    [*] Quadro FX 4600 (1.0)

    [*] Quadro FX 5600 (1.0)

    [*] GeForce 8800 (1.0)

    [*] GeForce 8500 (1.1)

    [*] GeForce 8600 (1.1)

    [*] GeForce 8600M (1.1)

    [*] GeForce 8400M (1.1)

GPU Computing is a standard feature in NVIDIA’s 8-Series and future GPUs. CUDA will be supported across a range NVIDIA GPUs although we recommend that the GPU have at least 256MB of graphics memory. System configurations with less than the recommended memory size may not have enough memory to properly support CUDA programs.

[*] Can I run CUDA under DOS?

CUDA will not work in full-screen DOS mode since the display driver is not loaded.

[*] What is the difference between compute capability 1.1 and 1.0?

Later GPUs in the G8x series support compute capability 1.1, which includes new instructions. Compute capability 1.1 supports atomic operations on global memory. See “What are atomic operations?” in the programming section below.

[*] What are the technical specifications of the NVIDIA Tesla C870 Processor ?

The Tesla C870 consists of 16 multiprocessors, each of which is comprised of 8 processors, for a total of 128 processors.

There is 16KB of shared memory per multiprocessor.

Each processor has a floating point unit which is capable of performing a single scalar multiply-add, plus a “superfunc” operation (such as rsqrt or sin/cos) per clock cycle.

The processors are clocked at 1.35 GHz. The peak computation rate accessible from CUDA is therefore around 346 GFLOPS (128 * 2 * 1.35). If you include the graphics functionality that is accessible from CUDA (such as texture interpolation), the FLOPs rate is much higher.

The card includes 1.5 GB of device memory. The maximum observed bandwidth between system and device memory is about 3GB/second.

Other products in the G8x series have the same basic architecture, but vary in the number of multiprocessors, clock speed, memory bus width and amount of memory.

See the programming guide for more details.

[*] Does CUDA support multiple graphics cards in one system?

Yes. There are motherboards available (for example the ASUS P5N32-E SLI) that support up to 3 cards. It is the responsibility of the application to distribute work across multiple GPUs.

The Tesla Deskside Supercomputer supports multiple GPUs in an external enclosure:

[*] Is is possible to DMA directly into GPU memory from another PCI-E device (for example a frame grabber or network card)?

Not currently, but we are investigating this.

[*] Where can I find a good introduction to parallel programming?

The course “ECE 498: Programming Massively Parallel Processors” at the University of Illinois, co-taught by Dr. David Kirk and Dr. Wen-mei Hwu is a good introduction:

The book “An Introduction to Parallel Computing: Design and Analysis of Algorithms” by Grama, Karypis et al isn’t bad either (despite the reviews):…2044533-7984021

Although not directly applicable to CUDA, the book “Parallel Programming in C with MPI and OpenMP” by M.J.Quinn is also worth reading:…3109&sr=1-1

[*] How does CUDA structure computation?

CUDA broadly follows the data-parallel model of computation. Each of the SIMD processors executes the same instruction on different elements of the data in parallel.

The data is split up into a 1D or 2D grid of blocks. Each block can be 1D, 2D or 3D in shape, and can consist of up to 512 threads on current hardware. Threads within a thread block can coooperate via the shared memory.

Thread blocks are executed as smaller groups of threads known as “warps”. The warp size is 32 threads on G8x.

[*] Can I run CUDA remotely?

Under Linux it is possible to run CUDA programs via remote login . We currently recommend running with an X-server.

CUDA does not work with Windows Remote Desktop, although it does work with VNC.

[*] How do I pronounce CUDA?


Programming questions


[*] What are the advantages of CUDA vs. graphics-based GPGPU?

CUDA is designed from the ground-up for efficient general purpose computation on GPUs. It uses a C-like programming language and does not require remapping algorithms to graphics concepts.

CUDA exposes several hardware features that are not available via the graphics API. The most significant of these is shared memory, which is a small (currently 16KB per multiprocessor) area of on-chip memory which can be accessed in parallel by blocks of threads. This allows caching of frequently used data and can provide large speedups over using textures to access data. Combined with a thread synchronization primitive, this allows cooperative parallel processing of on-chip data, greatly reducing the expensive off-chip bandwidth requirements of many parallel algorithms. This benefits a number of common applications such as linear algebra, Fast Fourier Transforms, and image processing filters.

Whereas fragment programs in the graphics API are limited to outputting 32 floats (RGBA * 8 render targets) at a pre-specified location, CUDA supports scattered writes - i.e. an unlimited number of stores to any address. This enables many new algorithms that were not possible to perform efficiently using graphics-based GPGPU.

The graphics API forces the user to store data in textures, which requires packing long arrays into 2D textures. This is cumbersome and imposes extra addressing math. CUDA can perform loads from any address.

CUDA also offers highly optimized data transfers to and from the GPU.

[*] How do I use multiple GPUs in one system?

See the “multiGPU” example in the SDK.

[*] Can the CPU and GPU run in parallel?

Kernel invocation in CUDA 1.0 is asynchronous, so the driver will return control to the application as soon as it has launched the kernel.

The “cudaThreadSynchronize()” API call should be used when measuring performance to ensure that all device operations have completed before stopping the timer.

CUDA functions that perform memory copies and that control graphics interoperability are synchronous, and implicitly wait for all kernels to complete.

[*] Can I download data and run a kernel in parallel (for streaming applications)?

Not currently. This functionality may be included in future release.

[*] Is it possible to DMA directly into GPU memory from another PCI-E device?

Not currently, but we are investigating ways to enable this.

[*] Is it possible to write the results from a kernel directly to texture (for multi-pass algorithms)

Not currently, but you can copy from global memory back to the array (texture). Device to device memory copies are very fast in CUDA 1.0.

[*] Can I write directly to the framebuffer?

No, you have to write to a mapped pixel buffer object in OpenGL, and then render from this. The copies are in video memory and very fast, however. See the “postProcessGL” sample in the SDK for more details.

[*] Can I texture directly from textures created in OpenGL/Direct3D?

Not currently. You can only texture from textures created in CUDA. It is possible to read OpenGL textures to buffer objects and then map these and do global reads in CUDA.

[*] How does CUDA/Direct3D interoperability work?

Direct3D interoperability is limited to vertex buffers.

[*] Does graphics interoperability work with multiple GPUs?

Not currently. Graphics interop is only supported on single GPU systems.

[*] How do I get the best performance when transferring data to and from OpenGL pixel buffer objects (PBOs)?

For optimal performance when copying data to and from PBOs, you should make sure that the format of the source data is compatible with the format of the destination. This will ensure that the driver doesn’t have to do any format conversion on the CPU and can do a direct copy in video memory. When copying 8-bit color data from the framebuffer using glReadPixels we recommend using the GL_BGRA format and ensuring that the framebuffer has an alpha channel (e.g. glutInitDisplayMode(GLUT_RGBA_ | GLUT_ALPHA) if you’re using GLUT).

[*] What texture features does CUDA support?

CUDA supports 1D and 2D textures, which can be accessed with normalized (0…1) or integer coordinates. Textures can also be bound to linear memory and accessed with the “tex1Dfetch” function.

The hardware only supports 1, 2 and 4-component textures, not 3-component textures.

3D textures, cube maps, texture arrays, compressed textures and mip-maps are not currently supported.

[*] What are the maximum texture sizes supported?

The maximum size for a 2D texture is 64K by 32K pixels, assuming sufficient device memory is available.

1D textures are limited to 8K elements.

1D “buffer” textures bound to linear memory are limited to 2^27 elements.

[*] What is the size of the texture cache?

The texture cache has an effective 16KB working set size per multiprocessor and is optimized for 2D locality.

[*] Are graphics operations such as z-buffering and alpha blending supported in CUDA?

No. Access to video memory in CUDA is done via the load/store mechanism, and doesn’t go through the normal graphics raster operations like blending. We don’t have any plans to expose blending or any other raster ops in CUDA.

[*] What are the peak transfer rates between the CPU and GPU?

The performance of memory transfers depends on many factors, including the

size of the transfer, and type of system motherboard used.

We recommend NVIDIA nForce motherboards for best transfer performance.

Example measured numbers for a Core 2 Duo processor, ASUS P5N32-SLI motherboard with 1GB memory and a GeForce 8800 GTX are:

                Pageable     Page-locked

Host - Device   1.7 GB/sec   3.1 GB/sec

Device - Host   1.7 GB/sec   3.1 GB/sec

Device - Device 70.7 GB/sec  70.7 GB/sec

You can measure it on your system using the bandwidthTest sample from the SDK.

Page-locked memory can be allocated using cuMemAllocHost(). However allocating too much page-locked memory can significantly affect the overall performance of the system, so use it with care.

[*] What is the precision of mathematical operations in CUDA?

All compute-capable NVIDIA GPUs support 32-bit integer and single precision floating point arithmetic. They follow the IEEE-754 standard for single-precision binary floating-point arithmetic, with some minor differences - notably that denormalized numbers are not supported. See the programming guide for more details.

[*] Does CUDA support double precision arithmetic?

CUDA supports the C “double” data type. However on G8x series GPUs, these types will get demoted to 32-bit floats.

You should be careful to specify float constants in single precision (i.e. “1.0f”) so that they will not accidentally be compiled to double precision on future hardware.

NVIDIA GPUs supporting double precision in hardware will become available in late 2007.

[*] When should I use the __mul24 and __umul24 functions?

G8x hardware supports integer multiply with only 24-bit precision natively (add, subtract and logical operations are supported with 32 bit precision natively). 32-bit integer multiplies compile to multiple instruction sequences and take around 16 clock cycles.

You can use the __mul24 and __umul24 built-in functions to perform fast multiplies with 24-bit precision.

Be aware that future hardware may switch to 32-bit native integers, it which case __mul24 and __umul24 may actually be slower. For this reason we recommend using a macro so that the implementation can be switched easily.

[*] Does CUDA support 16-bit (half) floats?

The driver API supports textures that contain 16-bit floats through the CU_AD_FORMAT_HALF array format. The values are automatically promoted to 32-bit during the texture read.

16-bit float textures are planned for a future release of CUDART.

Other support for 16-bit floats, such as enabling kernels to convert between 16- and 32-bit floats (to read/write float16 while processing float32), also is planned for a future release.

[*] Where can I find documentation on the PTX assembly language?

This is included in the CUDA 1.0 release.

[*] How do see the PTX code generated by my program?

Add “-keep” to the nvcc command line (or custom build setup in Visual Studio) to keep the intermediate compilation files. Then look at the “.ptx” file. The “.cubin” file also includes useful information including the actual number of hardware registers used by the kernel.

[*] Is it possible to see PTX assembly interleaved with C code?

Yes! Add the option “–opencc-options -LIST:source=on” to the nvcc command line.

[*] Does CUDA support operations on vector types?

CUDA defines vector types such as float4, but doesn’t include any operators on them by default. However, you can define your own operators using the standard C++ style:

__device__ float4 operator+(const float4 & a, const float4 & b) {

 Â  Â return make_float4(a.x+b.x, a.y+b.y, a.z+b.z, a.w+b.w);


Note that since G80 is a scalar architecture there is no inherent performance advantage to using vector types for calculation.

[*] Does CUDA support swizzling?

CUDA does not support swizzling (e.g. “vector.wzyx”, as used in the Cg/HLSL shading languages), but you can access the individual components of vector types.

[*] Is it possible to run multiple CUDA applications and graphics applications at the same time?

CUDA is a client of the GPU in the same way as the OpenGL and Direct3D drivers are - it shares the GPU via time slicing. It is possible to run multiple graphics and CUDA applications at the same time, although currently CUDA only switches at the boundaries between kernel executions.

The cost of context switching between CUDA and the graphics API is roughly the same as switching graphics contexts. This isn’t something you’d want to do more than a few times a frame, but is certainly fast enough to make it practical for use in games.

[*] Can CUDA survive a mode switch?

If the display resolution is increased while a CUDA application is running, the CUDA application is not guaranteed to survive the mode switch. The driver may have to reclaim some of the memory owned by CUDA for the display.

[*] Is is possible to execute multiple kernels at the same time?

No, CUDA only executes a single kernel on the machine at once. In some cases it is possible to have a single kernel perform multiple tasks by branching in the kernel based on the thread id.

[*] What is the maximum length of a CUDA kernel?

The maximum kernel size is 2MB of native instructions.

[*] How can I debug my CUDA code?

Use device emulation for debugging (breakpoints, stepping, printfs), but make sure to keep checking that your program runs correctly on the device too as you add more code. This is to catch issues � before they become buried into too much code � that are hard or impossible to catch in device emulation mode. The two most frequent ones are:

  • Dereferencing device pointers in host code or host pointers in device code

  • Missing __syncthreads().

See Section �Debugging using the Device Emulation Mode" from the programming guide for more details.

[*] How can I optimize my CUDA code?

Here are some basic tips:

    [*] Make as much use of shared memory as possible (it is much faster than global memory).

    [*] Make sure global memory reads and writes are coalesced where possible (see programming guide section

    [*] Avoid large-scale bank conflicts in shared memory.

    [*] If your memory reads are hard to coalesce, try using texture fetches instead.

    [*] Use types like float4 to load 128 bits in a single load.

    [*] Avoid divergent branches within a warp where possible.

[*] How do I choose the optimal number of threads per block?

For maximum utilization of the GPU you should carefully balance the the number of threads per thread block, the amount of shared memory per block, and the number of registers used by the kernel.

You can use the CUDA occupancy calculator tool to compute the multiprocessor occupancy of a GPU by a given CUDA kernel:

[*] What is the maximum kernel execution time?

On Windows, individual GPU program launches have a maximum run time of around 5 seconds. Exceeding this time limit usually will cause a launch failure reported through the CUDA driver or the CUDA runtime, but in some cases can hang the entire machine, requiring a hard reset.

This is caused by the Windows “watchdog” timer that causes programs using the primary graphics adapter to time out if they run longer than the maximum allowed time.

For this reason it is recommended that CUDA is run on a GPU that is NOT attached to a display and does not have the Windows desktop extended onto it. In this case, the system must contain at least one NVIDIA GPU that serves as the primary graphics adapter.

[*] Why do I get the error message: “The application failed to initialize properly”?

This problem is associated with improper permissions on the DLLs (shared libraries) that are linked with your CUDA executables. All DLLs must be executable. The most likely problem is that you unzipped the CUDA distribution with cygwin’s “unzip”, which sets all permissions to non-executable. Make sure all DLLs are set to executable, particularly those in the CUDA_BIN directory, by running the cygwin command “chmod +x *.dll” in the CUDA_BIN directory. Alternatively, right-click on each DLL in the CUDA_BIN directory, select Properties, then the Security tab, and make sure “read & execute” is set. For more information see:

[*] What are atomic operations?

Atomic operations allow multiple threads to perform concurrent read-modify-write operations in memory without conflicts. The hardware serializes accesses to the same address so that the behaviour is always deterministic. The functions are atomic in the sense that they are guaranteed to be performed without interruption from other threads. Atomic operations must be associative (i.e. order independent).

Atomic operations are useful for sorting, reduction operations and building data structures in parallel.

Devices with compute capability 1.1 support atomic operations on 32-bit integers in global memory. This includes logical operations (and, or, xor), increment and decrement, min and max, exchange and compare and swap (CAS).

To compile code using atomics you must add the option “-arch sm_11” to the nvcc command line.

Atomic operations values are not currently supported for floating point values, or on shared memory.

There is no radiation risk from atomic operations.

[*] Does CUDA support function pointers?

No, current hardware does not support function pointers. If you don’t need to switch between functions at runtime, you can sometimes use C++ templates or macros to compile different versions of your kernels that can be switched between by the host code.



[*] What is CUFFT?

CUFFT is a Fast Fourier Transform (FFT) library for CUDA. See the CUFFT documentation for more information.

[*] What types of transforms does CUFFT support?

The current release supports complex to complex (C2C), real to complex (R2C) and complex to real (C2R).

[*] What is the maximum transform size?

For 1D transforms, the maximum transform size is 16M elements in the 1.0 release.



[*] What is CUBLAS?

CUBLAS is an implementation of BLAS (Basic Linear Algebra Subprograms) on top of the CUDA driver. It allows access to the computational resources of NVIDIA GPUs. The library is self contained at the API level, that is, no direct interaction with the CUDA driver is necessary.

See the documentation for more details.

CUDA FAQ updated for 1.0 release

Note that the most recent FAQ is now posted here: