CUDA 7 Release Candidate Feature Overview: C++11, New Libraries, and More

I'm extremely interested in the nvrtc runtime compilation library. However in the samples there are only 3 applications under 4_Finance using this library and none of them makes use of templates. It's not apparent how you would pass in template parameters, especially the parameters deduced by host compiler. Is there a sample where runtime compilation of templated device code is used? Or we should wait for the "libnvrtc user guide" which was mentioned in the release notes?

xyzw_frequency<<<1, 1>>>(d_count, d_text, strlen(text));

When use one dimesion the result is not correct. How to solve it.

This is just a trivial example meant to demonstrate the language features, not to handle all configurations. Specifically, as written, it requires there to be at least as many threads running as there are matches. I realize this is kind of a silly limitation, but again, I whipped it up quickly just to demonstrate language features. I'm sure there are better ways to parallelize this and make it more general.

is there some cuSolverDN examples available? There is no sample in the path \CUDA Samples\v7.0, and it seems only mathematical step description without exact code in Appendix C of the cuSOLVER PDF document, Moreover what's the meaning of trsm mentioned in step3?

Is there an ETA for when CUDA will add support for C++14? It's great that CUDA now supports the use of modern C++ paradigms. But I already make heavy use of C++14 features in some of my projects, and I am very eager to add support for CUDA to them. I'd love to hear about what the plans are regarding C++14 support.

It would be helpful to hear from you and others which C++14 features are most important to you, and how you would like to use them in __device__ code. Thanks!

Thanks for the fast response! I'm primarily interested in being able to use C++14 features on the host side. As an example, one codebase I work on is an nd array library similar to Eigen, except that it allows loop optimizations (e.g. parallelization, tiling, unrolling, permutation) to be applied when the RHS of an expression is assigned. I am very interested in adding GPU "gridification" as an additional optimization to this library: this would allow the user to write concise expressions involving arrays (using C++14 equivalents of numpy's syntax), and choose to evaluate the expression either on the CPU or the GPU, without rewriting any code. Generic lambdas are used internally in the library to decompose loop nests and apply the loop optimizations. Variable templates are used to allow concise shorthand, e.g. "cr<0, 10, 2>" to express the constant range [0, 10] with stride 2. This lets me write things like "arr(cr<0, 10, 2>) = 5". To get this library to compile with nvcc, I would need auto return types, generic lambdas, and variable templates to work for host code.

I pasted your example code into a file, and got

nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2014 NVIDIA Corporation
Built on Tue_Dec__9_18:10:46_CST_2014
Cuda compilation tools, release 7.0, V7.0.17

nvcc --std=c++11 -o c++11_cuda internal error: assertion failed: remove_from_variables_list: not found (/dvs/p4/build/sw/rel/gpu_drv/r346/r346_00/drivers/compiler/edg/EDG_4.9/src/il.c, line 13467)

1 catastrophic error detected in the compilation of "/tmp/tmpxft_0000ad8c_00000000-9_c++11_cuda.cpp1.ii".
Compilation aborted.
nvcc error : 'cudafe' died due to signal 6
nvcc error : 'cudafe' core dumped

Is there anything I can do about this?

Sorry about this John -- this is a known bug in nvcc in the CUDA 7.0 RC, and it's already fixed in the version that will be in the CUDA 7.0 official release (and note it doesn't affect mac/clang). The workaround is to declare the list of letters in its own variable -- in fact I've updated the code in the post to use the workaround so it doesn't bite anyone else.

Your project sounds interesting Aditya-- is it open source? What do you mean by "C++14 equivalents of numpy's syntax"?

Revised version compiles. Thanks

Here's a link to a page in the Github repo describing the project: I haven't advertised the project publicly yet, as I am still implementing the last few features and writing the documentation. The ETA until I publicly release the project is two weeks. Being able to generate GPU code in C++ using the terse syntax described in that page is why I was very excited to learn that CUDA is beginning to support modern C++ features. Please let me know if you would like any more information.

Let me thank here too Mark for the recent updates to this article which swapped the previous highly questionable performance benchmarks (CUDA 7.0 on GPU vs MKL on CPU) with relevant and much more reasonable comparisons.

Kudos for the prompt action and keep up the good work!

Reminder: the cuSOLVER home page ( still needs fixing.

I guess the big question is "Does NVIDIA see a conflict with std::thread and std::atomic semantics and NVIDIA parallel constructs?"

Since time has passed and CUDA 7.5 is out, let me correct my above comment. With CUDA 7.5 there is a new experimental feature "GPU Lambdas", which allows you to define a lambda in host code with a __device__ annotation and pass it to a kernel. This effectively gives you the ability to "launch" a lambda. See the CUDA 7.5 features post:

Can you elaborate?

The CUDA 7 release notes (Chapter 3) clearly states "The CUDA Toolkit no longer supports 32-bit Windows operating systems."

having code like

void task();
std::thread t(task);

an explicit task fork-join parallel model. Would a program that exposes it's parallelism using C++11 threading be mappable to a NVIDIA GPU by nvcc?

I think you could support the use of std::atomic as pass-through code to CUDA atomics, or built from them while keeping GPU thread execution semantics. Probably std::mutex is buildable as well.

Perhaps a subset of std::thread programs could be efficiently mapped to a SIMT hardware design but it would take a very clever compiler to even discover a person was trying to write SIMT code with std::thread. Also std::thread supports a whole bunch of parallel programming styles that perhaps will never map efficiently onto a GPU architecture.

I just never hear anyone at NVIDIA even mention std::thread, even the ones that are very involved in the C++ standards development.

std::thread, like pthreads, is not currently compatible with GPU execution. These C++ features are something we consider a possibility for the future.

why is intel compiler supported and c++11 but not the combination?

nvcc warning : The -c++11 flag is not supported with the configured host compiler. Flag will be ignored.