As part of the Registered Developer Program.
Oh my God, that picture is amazing O_o
Btw, what’s CUDA 7.5 do? 7 brought C++11. What else is there to want? Unless reads are finally going to not be done in 128 bytes chunks and can be done across the whole card, is there anything really worthwhile happening?
You can have a look at the release notes on the download page.
As for my personal experience: I haven’t tested all of my programs, but I am seeing more aggressive kernel register usage across the board. I have two DES crypt(3) kernels, one is bounded by shared memory and the other register in terms of occupancy. The shared memory bounded one went from
1> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads 1> ptxas info : Used 56 registers, 15360 bytes smem, 356 bytes cmem, 48 bytes cmem
1> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads 1> ptxas info : Used 79 registers, 15360 bytes smem, 356 bytes cmem, 44 bytes cmem
without spilling and saw a 5% performance boost (max registers without occupancy drop is 80 ~ 256/3); while the register bounded one went from
1> 0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads 1> ptxas info : Used 164 registers, 364 bytes cmem, 48 bytes cmem
1> 48 bytes stack frame, 72 bytes spill stores, 72 bytes spill loads 1> ptxas info : Used 168 registers, 364 bytes cmem, 44 bytes cmem
and cut the performance by 2% due to slight register spilling (max registers without occupancy drop is 168 ~ 256*2/3). I was using launch_bounds to limit the registers. If I remove the launch bound, I see a large performance drop of 30% due to lower occupancy despite no register spilling.
Another thing I don’t like is that while I can turn on TCC mode on my Titan X now (new CUDA 7.5 feature), but I can’t do the same for my 980 Ti. I know that TCC mode on 980 Ti is not officially supported, but is there an architectural reason or is it just an arbitrary decision? The two are very similar cards.
You know, for a famous economist, you sure seem to be really good at CUDA too O_o
Ahh, good to see the LOP3 RFE made it to this version.
Ok, I read the parallelforall blog, and all that says is that for Windows one can now use remote desktop(which was not possible before other than via VPN).
But does that mean that the GTX Titan X has full TCC capability similar to the Tesla line? What about RDMA ? Will you be able to toggle WDDM/TCC like the Quadro line?
If one has two GTX Titan X GPUs in the same machine, will device-to-device memory copies (P2P) now be at a faster rate than the same transaction performed using CUDA 6.5/7 ?
To me the the addition of 16 bit float support in cuBLAS is interesting.
I just ran some basic benchmarks on the new cublasSgemmEx call and nvidia still has some catching up to do. Looks like (for fp16) only 128x64 tile sizes are supported which means low utilization on minibatch sizes below 128. My open source (Apache2) kernels are getting 2x+ the utilization in small minibatch (important for very large RNNs). I also have both minibatch transpose formats implemented, and more optimizations on the way. Oh, and this is in an easy to use numpy like python interface (the simplest C benchmark script I could write is over 100 lines of code, but no doubt scikits.cuda will remedy that soon).
I’m also able to integrate activation functions (and even pre-activation output) directly in the kernels (which can double again the perf of RNNs/LSTMs).
I’d cuobjdump the lib and do some more in depth analysis but it appears that’s currently broken (cuobjdump -xelf). cuobjdump has been more generally broken since 7.0.
For cuDNN I’m also kind of surprised they’re only claiming a 1.5x speedup in VGG (the most important CNN right now). I’ve been posting 2x+ numbers there for several months now.
You can toggle WDDM/TCC for Titan X in nvidia-smi:
Interesting, nvidia-smi didn’t show power in watts and compute utilization percentage in previous drivers for GeForce cards and it appears it now does, neat!