New Features in CUDA 7.5

jwitsoe · July 8, 2015, 7:07am

Originally published at: New Features in CUDA 7.5 | NVIDIA Technical Blog

Today I’m happy to announce that the CUDA Toolkit 7.5 Release Candidate is now available. The CUDA Toolkit 7.5 adds support for FP16 storage for up to 2x larger data sets and reduced memory bandwidth, cuSPARSE GEMVI routines, instruction-level profiling and more. Read on for full details. 16-bit Floating Point (FP16) Data CUDA 7.5 expands…

anon2034648 · July 8, 2015, 1:16pm

Thank you for adding Clang support on Linux! I've been looking forward to that feature for a while. Once Clang gains full MSVC ABI compatibility on Windows ( http://clang.llvm.org/docs/... ), is there a chance that Cuda will support compiling with Clang on Windows?

anon95180265 · July 9, 2015, 3:17am

Clang on Linux has reached a level of maturity that made it a) feasible for us to support it as a host compiler and b) created demand for it as a host compiler. If/when Clang achieves that level of maturity on Windows, naturally NVIDIA will consider supporting it as a host compiler on Windows too.

anon41121957 · July 10, 2015, 1:16am

Does the FP16 supported in cuFFT? for example, can we use __half2 complex?.

anon75806142 · July 11, 2015, 7:25pm

I get this message on my MacBook Pro

nvcc fatal : The version ('60100') of the host compiler ('Apple clang') is not supported

anon70587560 · July 13, 2015, 5:32pm

A colleague recently recommended this post that covers the use of recursive neural networks RNNs in Natural Language Processing (NLP). Good stuff:

http://colah.github.io/post...

anon5461453 · July 15, 2015, 10:16am

Anyone tried CUDA 7.5 RC (64bit version for windows 8.1 64) yet? I have encountered some serious problems with this version. Some computationally-intensive codes works flawless with CUDA 6.0-7.0, now broken with this 7.5. For instance my program can hang there forever and never returns if it is built by CUDA 7.5 yet works flawlessly with eariler CUDA versions(6.0, 6.5 and 7.0 are the versions I tested so far), and Nsight/cuda-memchecker cannot find any problems with programs either (they simply hang there forever without reporting any issues).

I tested it on two different systems, one is a laptop (Haswell with a 970M gtx) and the other is a 3-GPU 2-socket Haswell-EP workstation with one K6000 and two Titan-X installed, on both systems, the programs hang there forever with CUDA 7.5 and return normally with CUDA 7.0.

I dont know if this is just me or anyone else have experienced simliar problem?

In CUDA 7.5, have nvidia change any default behaviors for CUDA besides the null stream's behavior? or could that be just a driver issue.

At the moment I dont have time to locate the problems so I simply go back to 7.0.

anon49424191 · July 15, 2015, 8:51pm

I had the exact same problem with CUDA7.0 on my laptop machine in debug mode under Visual Studio. The program hang at the first line of code to allocate GPU memory. Actually I found out if I wait long enough, it will return eventually, but the wait is painfully long. I reported the bug but that bug report is still open as of today. I just installed CUDA7.5 and was hoping that they fixed the problem in the new version. Apparently, the problem is still there in this new version.

anon95180265 · July 16, 2015, 1:16am

Hi wq, this is one of the reasons we do a public RC -- to find issues early! I'm sorry you are having a bad experience so far. Would you be able to create a repro example? If you are a registered developer (register if you have not already here: https://developer.nvidia.co..., you can log in and file a bug and attach the repro. Alternatively you can email it to me at <first initial="" last="" name@nvidia.com=""> If you file a bug, please post or send the bug ID#. Thanks!

anon95180265 · July 16, 2015, 1:17am

Hi sally, I'm sorry to hear this. Can you post the bug ID#? Thanks!

anon49424191 · July 16, 2015, 1:39am

Hi Mark,

Thanks for the response. The bug ID# is 1644368. I also sent a repro project to your team and got confirmed that was reproducible "on a Dell XPS platform that configured with a GT640M GPU with CUDA 7.0 Production release. Looks like it’s a specific issue on notebook model GPUs which using another graphic(i.e. intel) as display output.". I can forward you the email if you want. Thanks again.

anon5461453 · July 16, 2015, 12:38pm

Thanks for your reply, after a little investigation I can be 99% certain that this problem is related to an open-source CUDA library function my program called, don't know if this is a CUDA 7.5 issue, but this library has not given me any problems when built by earlier CUDA versions, I will try to contact the author of the library first.

anon49424191 · July 16, 2015, 1:57pm

Hi Mark,

I found out more. The code generation setting for CUDA C/C++\DEVICE on my project was set to "compute_20,sm_20;". When I added compute_30,sm_30 to the setting, that problem is gone.

anon95180265 · July 17, 2015, 1:40am

Hi Sally, this sounds like JIT compilation overhead. When you compile with the SM version of the GPU you are running on, the runtime doesn't need to JIT from PTX. But if you compile only for an older version, it has to JIT at startup. Usually it stores the JITted code in a cache, but there are cases where it gets flushed, or doesn't fit, or is on a remote share (in a cluster situation), all of which can make this overhead take longer than normal. I recommend you read this post: http://devblogs.nvidia.com/...

anon95180265 · July 17, 2015, 1:41am

Please see my reply to sally below regarding JIT compilation overhead. Let's make sure you are not having the same problem. Perhaps the library is compiled for the default or an older architecture, and it has a lot of kernels that have to be JITted at startup? What library is it?

anon49424191 · July 17, 2015, 3:20am

Mark,

Just read the blog you recommended. From what I have seen, it does look like JIT overhead, and since my project is pretty big and uses several libraries, so it could have hit the cache size limit. In my testing when I removed all the files from my project and just have very few lines of testing code in the main function, that long start up time is gone. But as I added my files back, even I have no code calling them, I start to see this problem again.

What I don't understand is that my code generation setting was "compute_20,sm_20", and I was running debug code on my local machine which has number of SMs = 2. So it should not need JIT. I don't have this problem if I switch the build target back to CUDA 6.5. And I don't see this long wait with CUDA7 running in release mode either. At least now I have a workaround of building "Fat Binaries" to avoid this problem.

anon95180265 · July 17, 2015, 4:27am

Hi Sally: sm_20 refers to NVIDIA SM architecture version 2.0 (also known as Compute Capability 2.0), aka Fermi. It does not refer to the number of SMs on the GPU, but the capabilities (i.e. instruction set architecture) of the SM. So when you compile for a different SM version, you are targeting a different instruction set, not a different number of SMs. What GPU do you have on your local machine? I suspect it is a Kepler GPU, SM version 3.x.

anon5461453 · July 17, 2015, 11:08am

No, I always build my codes with the specific compiler options that match the targeting CUDA device (in my case: cc/sm 3.5 and 5.2), the library involved is NVIDIA's CUB library, which is a template C++ library, so it has the same compiler options as my main programs, with CUDA 7.5 certain CUB library functions behave like they encounter dead-locks: the GPU is busy but the actual load is very low, and it hangs there basically forever, but thats just my wild guess.

I have already informed the author of CUB about this.

anon49424191 · July 17, 2015, 12:57pm

Oh, now I got it. Yes my GPU is Kepler with COMPUTE_CAPABILITY_MAJOR = 3 and COMPUTE_CAPABILITY_MINOR = 0

anon95180265 · July 20, 2015, 12:59am

I will check with Duane (CUB author). But if you can provide a simple repro that would help.

Topic		Replies	Views
CUDA Toolkit and SDK 2.3 betas available to registered developers CUDA Programming and Performance	60	104572	July 22, 2009
CUDA Toolkit and SDK v2.2 released CUDA Programming and Performance	59	64618	January 25, 2011
CUDA Toolkit 3.0 beta released now with public downloads CUDA Programming and Performance	104	430081	March 25, 2010
CUDA 2.1 discussion CUDA Programming and Performance	71	63937	February 17, 2009
CUDA 2.2 beta features CUDA Programming and Performance	146	126069	May 19, 2009
CUDA Toolkit 3.2 release candidate available to registered developers CUDA Programming and Performance	68	63108	December 3, 2010
CUDA Toolkit 3.0 released CUDA Programming and Performance	62	26015	September 21, 2010
my speedy FFT 3x faster than CUFFT CUDA Programming and Performance	139	241010	November 16, 2011
CUDA very slow performance CUDA Programming and Performance	21	16301	March 6, 2020
CUDA Pro Tip: Use cuFFT Callbacks for Custom Data Processing Technical Blog	18	1189	August 15, 2023

New Features in CUDA 7.5

Related topics