CUDA 9 Features Revealed: Volta, Cooperative Groups and More

Hello.
Currently, I'm
having problems with cuBlas, this library works with matrices of type
column-major, most of my programs are defined in matrix of type
row-major. CuBlas would be much friendlier, if I could take the data directly in both ways, and work internally in the most convenient way.
Thank you so much

The cuBLAS API allows you to specify a transpose of your matrices. In many cases, this is handled inside the computational kernels without additional copies, so it is efficient. See the documentation for cublasOperation_t: http://docs.nvidia.com/cuda...

When can we expect a (RC) release? It's been over a month! Can't wait to try the new features.

Do cooperative groups allow synchronization across multiple SMs?

I believe __syncthreads only works within a single SM. I was wondering if this limitation is lifted with cooperative groups.

Yes, the example in the post shows how you will be able to call this_grid() to get a group referring to all threads running on the GPU (on all SMs). This can then be synchronized as shown. This functionality requires Pascal or later GPUs. In CUDA 9 you will be limited to only synchronizing ALL threads, not a subset of thread blocks. Hopefully we can generalize that and make it more flexible in a future release.

Thanks, that looks like what I was looking for. Previously we had to resort to tricks like the lock-free global spinlocks ( http://eprints.cs.vt.edu/ar... ) - hopefully this new technique will be efficient

The link to your talk in GPU Tech on demand is not working, could you fix it , please?

Great article! Thanks Mark!

You can find it here: http://on-demand-gtc.gputec...

Direct link: http://on-demand.gputechcon...

Fixed. Thanks.

Release again ancient Keppler GTX780 in 6 and 8GB GDDR5 in 100$ - make a market-aggresive ( under production costs - like Sony "patented" consoles ) good-quality product in low price. It should be competitive to better products from AMD in range of lower prices. Make a fusion in CUDA with AMD - it is better suitable than OpenCL in my personal opinion. GTX 1030 price is too high ( GT 730 4GB has better quality/price coeff. ). Is not it more profitable to make a LOT easier to produce older technology, than raw recalculation of new products line? Post Scriptum it is my personal opinion - I am not an expert, it should more conservative to ask an economical specialist.

It already exists it is called autocoding, automatic coding I don't remember..I think "DeepCoder" from Microsoft...

question 2 is back, thanks SO MUCH to AMD that is kicking nvidia politic in the ass. so now, titan new driver is "magically" providing "some" (no precision" features of the quadro drivers. viva competition!

http://www.nvidia.com/downl...
https://www.reddit.com/r/nv...

what's next :
- provide a drivers that support 100% virtualization
- add full 10bit support
- enable to switch from "gaming" driver to "pro", without even rebooting
- create a REAL distinction between titan and quadro

question 2 is back, thanks SO MUCH to AMD that is kicking nvidia politic
in the ass. so now, titan new driver is "magically" providing "some"
(no precision) features of the quadro drivers. viva competition!

http://www.nvidia.com/downl...

https://www.reddit.com/r/nv...

what's next :
- provide a drivers that support 100% virtualization
- add full 10bit support
- enable to switch from "gaming" driver to "pro", without even rebooting
- create a REAL distinction between titan and quadro

I tested CUDA 9 over 8 today and got expected speedup due to better FFT performance but also got speedup of code that serially executes groups of CUDA kernels and I do not quite understand why it is faster when the individual groups of kernels (so-called modules) are only marginally faster on CUDA 9, the only exception is the already mentioned group of CUDA kernels that uses FFTs, which is around 19% faster. Any thoughts?

What sort of speedup are you getting? Looking forward to getting our ocean FFTs upgraded..

20% approx.

May I refer you to "A very comprehensive and precise spec"
http://www.commitstrip.com/...

What about Cooperative Groups?
They are quite useful in some applications.

How it is possible to synchronize ALL threads in situations when total grid size larger than maximum number of real threads? In this way, CUDA need to save contexts of local variables of all real threads, run code for other grid parts till g.sync() by real threads, and then return to saved contexts of first grid parts to run code after g.sync(). So how and where contexts of real threads are saved?