New features in CUDA 5.5

If you’re a registered developer and are interested in what is new in CUDA 5.5, here are the changes that jumped out at me:

  • Many documentation updates. For example, the Runtime and Driver APIs are now split into separate docs. Lots of NVVM details.
  • CUDA MPS: "CUDA MPS is a feature that allows multiple CUDA processes to share a single GPU context."
  • A list of changes at the front of the CUDA C Programming Guide.
  • Updated instruction throughput table in the Guide.
  • A new utility "nvdisasm". See CUDA Binary Utilities.
  • The PTX ISA 3.2 doc has undergone some heavy reformatting. I prefer the old doc's forced page break for each PTX opcode. The new format is harder to navigate and skim. Please do some more formatting on this doc!
  • The atomic and intrinsic sm_35 include files are basically empty and simply include... sm_32 include files. Not sure what the implications of "sm_32" are. :)
  • There are no new intrinsics but __ldg() instances for all the native types have been properly added. This will let you avoid some ugly casting.

What did I miss?

There are new device functions __fsub_r{z,n,d,u} and __dsub_r{z,n,d,u}. While these do not expose any new functionality they were added for programmer convenience based on customer feedback.

I have forwarded the feedback on the PTX docs. In general, please report all issues with the release candidate (including documentation issues and feature requests) through the bug reporting form linked from the registered developer website, as this provides for accurate tracking of all issue.

The Graphviz’able “call graph” output from nvdisasm is interesting!

I’ve been conditioned/Pavlov’d into filing CUDA bugs and had dutifully filed my documentation complaint earlier today. :)

The next feature I’d like in nvdisasm is a “register pressure” running total of registers in use. Bug filed.

Early recompiles show that the “merged NOT+AND” now works again on sm_3x.

Also, I had some kernels that would spill a few registers (and shouldn’t have) on CUDA 5.0. They now compile cleanly to 63 registers. So that’s positive.

Thanks for filing bugs. It is the only reliable way of getting issues into the pipeline and resolved. The CUDA 5.0 issue of NOT not being merged with a dependent logical operation (any LOP, not just AND) although all GPUs support this should be addressed comprehensively in 5.5.

From PTX 3.2 manual:“The texture instruction supports reads from multi-sample and multisample array
so seems also multisample textures are new… more info please if and how to use from CUDA C kernels
as seems no new texture functions …
also some sample how to create a multisample texture both a new texture or from a graphics GL/DX multisample texture would be good…
Also now OCL has extensions for both depth textures support and multisample textures so has CUDA 5.5 support for depth texture i.e. texture created from OpenGL depth textures (with GL_DETPH format)…
also NV add that to OpenCL as Intel GPU support right now!

seems that GPUDirect is now available on all Kepler GPUs: does that mean that we can finally use

cuPointerGetAttribute(&return_data, CU_POINTER_ATTRIBUTE_P2P_TOKENS, ptr0);

on GTX 670 / 680 / 690 ? Or do we still need a Quadro or Tesla card?

A few more bullet points from the announcement:
Multi-process MPI debugging & profiling
• Single GPU debugging for Linux
• Step-by-step guided performance analysis
• Static CUDA runtime library
• RPM/DEB packaging & new NVIDIA repo

Mac installation was error free, but I still had some problems with the driver not matching the runtime library version. But this is likely somehow my own user error, since reinstalling 5.0, then 5.5 a second time fixed it, even though I didn’t do anything special or different.

Another nice addition to 5.5: quite a few new SDK code sample projects!

Newly added:

    MC_EstimatePiInlineQ MC_EstimatePiP MC_EstimatePiQ batchCUBLAS boxFilterNPP cdpAdvancedQuicksort cdpBezierTessellation cdpLUDecomposition cdpQuadtree cdpSimplePrint cdpSimpleQuicksort conjugateGradient conjugateGradientPrecond cppOverload cudaOpenMP freeImageInteropNPP grabcutNPP histEqualizationNPP imageSegmentationNPP interval jpegNPP simpleCUBLAS simpleCUDA2GL simpleCUFFT simpleDevLibCUBLAS simpleHyperQ template_runtime
  • Native ARM support (?) [1]


Yes, in the Linux toolkit. Though it’s a CROSS compiler, allowing x86 to compile for ARM. At GTC '13 Donald Becker’s ARM talk mentioned that there was no native ARM compiler support yet, mostly due to the compiler chain itself needing 64 bit support.

thanks! I must have missed that talk at GTC. There were so many interesting ones to go to :)

You can check out the slides at least.

I was a little disappointed that the CUDA static RT library was only available for VC10 (Release build) on Windows. How soon before we get at least VC11 (Debug/Release) static libs. Thx

What about the software emulation support for true pointer-as-a-pointer between CPU and GPU?
IIRC it was supposed to be emulated in software for 5.5 and full harware support in Maxwell?



I’m testing my app over CUDA 5.5. I’ve read that to use, you have to set the compiler option -cudart=shared. What happen with CUBLAS libraries (static and dynamic)? How does it work? using -lcublas or there is an option like -cudart (which I can not find at compiler documentation)?