New Features in CUDA 7.5

Originally published at:

Today I’m happy to announce that the CUDA Toolkit 7.5 Release Candidate is now available. The CUDA Toolkit 7.5 adds support for FP16 storage for up to 2x larger data sets and reduced memory bandwidth, cuSPARSE GEMVI routines, instruction-level profiling and more. Read on for full details. 16-bit Floating Point (FP16) Data CUDA 7.5 expands…

Thank you for adding Clang support on Linux! I've been looking forward to that feature for a while. Once Clang gains full MSVC ABI compatibility on Windows ( ), is there a chance that Cuda will support compiling with Clang on Windows?

Clang on Linux has reached a level of maturity that made it a) feasible for us to support it as a host compiler and b) created demand for it as a host compiler. If/when Clang achieves that level of maturity on Windows, naturally NVIDIA will consider supporting it as a host compiler on Windows too.

Does the FP16 supported in cuFFT? for example, can we use __half2 complex?.

I get this message on my MacBook Pro

nvcc fatal : The version ('60100') of the host compiler ('Apple clang') is not supported

A colleague recently recommended this post that covers the use of recursive neural networks RNNs in Natural Language Processing (NLP). Good stuff:

Anyone tried CUDA 7.5 RC (64bit version for windows 8.1 64) yet? I have encountered some serious problems with this version. Some computationally-intensive codes works flawless with CUDA 6.0-7.0, now broken with this 7.5. For instance my program can hang there forever and never returns if it is built by CUDA 7.5 yet works flawlessly with eariler CUDA versions(6.0, 6.5 and 7.0 are the versions I tested so far), and Nsight/cuda-memchecker cannot find any problems with programs either (they simply hang there forever without reporting any issues).

I tested it on two different systems, one is a laptop (Haswell with a 970M gtx) and the other is a 3-GPU 2-socket Haswell-EP workstation with one K6000 and two Titan-X installed, on both systems, the programs hang there forever with CUDA 7.5 and return normally with CUDA 7.0.

I dont know if this is just me or anyone else have experienced simliar problem?

In CUDA 7.5, have nvidia change any default behaviors for CUDA besides the null stream's behavior? or could that be just a driver issue.

At the moment I dont have time to locate the problems so I simply go back to 7.0.

I had the exact same problem with CUDA7.0 on my laptop machine in debug mode under Visual Studio. The program hang at the first line of code to allocate GPU memory. Actually I found out if I wait long enough, it will return eventually, but the wait is painfully long. I reported the bug but that bug report is still open as of today. I just installed CUDA7.5 and was hoping that they fixed the problem in the new version. Apparently, the problem is still there in this new version.

Hi wq, this is one of the reasons we do a public RC -- to find issues early! I'm sorry you are having a bad experience so far. Would you be able to create a repro example? If you are a registered developer (register if you have not already here:, you can log in and file a bug and attach the repro. Alternatively you can email it to me at <first initial="" last=""""> If you file a bug, please post or send the bug ID#. Thanks!

Hi sally, I'm sorry to hear this. Can you post the bug ID#? Thanks!

Hi Mark,

Thanks for the response. The bug ID# is 1644368. I also sent a repro project to your team and got confirmed that was reproducible "on a Dell XPS platform that configured with a GT640M GPU with CUDA 7.0 Production release. Looks like it’s a specific issue on notebook model GPUs which using another graphic(i.e. intel) as display output.". I can forward you the email if you want. Thanks again.

Thanks for your reply, after a little investigation I can be 99% certain that this problem is related to an open-source CUDA library function my program called, don't know if this is a CUDA 7.5 issue, but this library has not given me any problems when built by earlier CUDA versions, I will try to contact the author of the library first.

Hi Mark,

I found out more. The code generation setting for CUDA C/C++\DEVICE on my project was set to "compute_20,sm_20;". When I added compute_30,sm_30 to the setting, that problem is gone.

Hi Sally, this sounds like JIT compilation overhead. When you compile with the SM version of the GPU you are running on, the runtime doesn't need to JIT from PTX. But if you compile only for an older version, it has to JIT at startup. Usually it stores the JITted code in a cache, but there are cases where it gets flushed, or doesn't fit, or is on a remote share (in a cluster situation), all of which can make this overhead take longer than normal. I recommend you read this post:

Please see my reply to sally below regarding JIT compilation overhead. Let's make sure you are not having the same problem. Perhaps the library is compiled for the default or an older architecture, and it has a lot of kernels that have to be JITted at startup? What library is it?


Just read the blog you recommended. From what I have seen, it does look like JIT overhead, and since my project is pretty big and uses several libraries, so it could have hit the cache size limit. In my testing when I removed all the files from my project and just have very few lines of testing code in the main function, that long start up time is gone. But as I added my files back, even I have no code calling them, I start to see this problem again.

What I don't understand is that my code generation setting was "compute_20,sm_20", and I was running debug code on my local machine which has number of SMs = 2. So it should not need JIT. I don't have this problem if I switch the build target back to CUDA 6.5. And I don't see this long wait with CUDA7 running in release mode either. At least now I have a workaround of building "Fat Binaries" to avoid this problem.

Hi Sally: sm_20 refers to NVIDIA SM architecture version 2.0 (also known as Compute Capability 2.0), aka Fermi. It does not refer to the number of SMs on the GPU, but the capabilities (i.e. instruction set architecture) of the SM. So when you compile for a different SM version, you are targeting a different instruction set, not a different number of SMs. What GPU do you have on your local machine? I suspect it is a Kepler GPU, SM version 3.x.

No, I always build my codes with the specific compiler options that match the targeting CUDA device (in my case: cc/sm 3.5 and 5.2), the library involved is NVIDIA's CUB library, which is a template C++ library, so it has the same compiler options as my main programs, with CUDA 7.5 certain CUB library functions behave like they encounter dead-locks: the GPU is busy but the actual load is very low, and it hangs there basically forever, but thats just my wild guess.

I have already informed the author of CUB about this.

Oh, now I got it. Yes my GPU is Kepler with COMPUTE_CAPABILITY_MAJOR = 3 and COMPUTE_CAPABILITY_MINOR = 0

I will check with Duane (CUB author). But if you can provide a simple repro that would help.