What is the difference between runtime and driver API?

I know that there are two versions of CUDA, runtime and driver, but I’ve never really understood what the difference is.

When should I use driver API, when should I use runtime API?

The two APIs exist largely for historical reasons. When it first came into existence, CUDA used what is now known as the driver API. However, it soon became apparent that this is a somewhat cumbersome interface, especially as far as the complexity of host code for kernel launches is concerned. This led to the invention of the runtime API, which provides a somewhat higher level of abstraction. I was the first user of the runtime API when it became available about 6.5 years ago, and I have not used the driver API since. For a transitional period of time (lasting until about CUDA 3.0 if I recall correctly), there were some issues with limited interoperability of the driver and runtime APIs, and some functionality was only available via the driver API but not the runtime API. I believe these issues have been resolved, and at this time I would not recommend that people start any new development based on the driver API. The CUBLAS and CUFFT libraries are for example entirely built on top of the runtime API.

I’ve read/heard that it’s possible to use half-floats (16 bit floats) through the driver API. Is it also possible through the runtime API? I’m working on memory demanding 4D image processing and 16 bit floats would reduce the memory requirement a factor 2.

There is support for textures using half-floats, and to my knowledge this is not limited to the driver API. There are intrinsics __float2half_rn() and __half2float() for converting from and to 16-bit floating-point on the device; I believe texture access auto-converts to float on reads. Other than the support in textures there is no native support for half-precision operations in the hardware and thus there is no half-precision floating-point type exposed in CUDA. The basic approach is to use half-precision for storage, and do computation in single-precision.

I have not used this functionality myself, please check the CUDA C Programming Guide and the API Reference Manual.

I have a question : Can I get the same performance by using runtime API than driver API?

The runtime API is, for the most part, a very thin wrapper around the driver API, so performance differences should be negligible. However, I have no worked examples for side-by-side performance comparison to prove this. I would be interested in hearing about real-life comparisons from any CUDA users who have ported their code from the driver API to the runtime API or vice versa.

I am resurrecting this thread because I am still trying to figure out the source of some weirdness in warp behavior updating shared memory.

I have CUDA 8.0 RC and CUDA 7.5 installed and when I run the CUDA 7.5 device query I see this output;

Device 0: "GeForce GTX TITAN X"
  CUDA Driver Version / Runtime Version          8.0 / 7.5
  CUDA Capability Major/Minor version number:    5.2

.....
Device 1: "GeForce GTX 1080"
  CUDA Driver Version / Runtime Version          8.0 / 7.5
  CUDA Capability Major/Minor version number:    6.1

Again I am seeing a difference in kernel behavior in other systems with older GPU drivers using CUDA 7.5 and kernel behavior on this newer system which has the latest drivers installed. CUDA 8.0 RC was not installed on the older machines which run this particular kernel correctly every time, but on the new system there are some deviations.

On the newer machine I am building using CUDA 7.5 and running the code on a Maxwell Titan X.

My question if I am building against CUDA 7.5 and running on this system, does the fact that the ‘CUDA Driver Version’ is listed as 8.0 have any affect on the code compilation or on the behavior of the executed code?

No I am not using the GTX 1080 for the tests, as I know that it is not supported by CUDA 7.5. Only using the Maxwell Titan X for such tests.

How could a newer driver make a difference for an existing CUDA application? Here is what I can think of oo the op of my head:

[1] The application uses JIT compilation, either on purpose or inadvertently (missing SASS for the GPU architecture in use). The driver will frequently contain a newer version of PTXAS then ships with the offline compiler.

[2] There was a functional change to an existing CUDA API. This is extremely unlikely at this point and any occurrence would presumably be well communicated by NVIDIA.

[3] The application inadvertently relies on undefined behavior, and the driver artifacts in older drivers that made the code work by chance no longer exist. That is a fairly frequent scenario in my experience, as programmers have a tendency to consider whatever works with current software as specified and guaranteed behavior, even if the specification actually disagrees.

[4] A driver bug was introduced with the newer driver. It is likely that a couple of instances of this occur with every new driver release. Mostly these would be minor issues, or issues affecting infrequently occurring corner cases, as these are easiest to miss during QA.

Thanks for the response.

I did find a way to guarantee the correct behavior on using either CUDA 7.5 or CUDA 8 RC for both 5.2 and 6.1.

When I have the time will generate a reproduce-able test application, because there may be a problem with warp synchronous behavior updating shared memory. If I did in fact make a mistake in my understanding of intra-warp behavior that should flush it out.