Titan X (with latest drivers) slower than Titan Black with older drivers

Just got the new Titan X and did a cursory benchmark on a Matlab mex function.

It is not heavy on number crunching, so not representative of that case, but I was surprised to see it running at half the speed as the Titan Black with the previous driver.

So I ran the same benchmark with the Titan Black and the new driver, and it is half the speed from what it was before.

Obviously this points to a driver issue, rather than the hardware.

Could the latest (350.12) driver be that bad?

It occurred to me that in order to use all of the 24SMMs on the Titan X and thus run faster, the code may need to be compiled for capability 5.2.

Is that the case?

Of course compiling for 3.5 does not explain why a Titan Black runs slower with the latest drivers than it did with the older drivers…

yes, it needs to be compiled for 5.2 .

In general I have found the GTX Titan X to be about 30% faster than the GTX 980 and the GTX 780ti.

Surprisingly the the Titan X seems to be the best at 32 bit integer operations, and is over 40% faster than the GTX Titan Black for the same problems

Using CUDA 6.5 Windows 7 x64 .

So maybe still using CUDA 5.5 accounts for the slowness of the latest drivers…

I updated to the latest CUDA 6.5.19 and my CUDA function still takes twice the best time I got with the Titan Black and the older drivers. ???

It seems to me you are performing an insufficiently controlled experiment where at least two variables are changed at the same time, making it impossible to tease apart the contributions of either change. I would suggest switching to controlled experiments, where only one variable is changed at any time.

If you keep the GPU the same, and change the driver version, and see a performance difference, that could be indicate of an actionable driver changes, and you could consider filing a bug with NVIDIA.

If you keep the driver version the same and change the GPU from Titan Black to Titan X (or vice-versa), any performance differences observed are likely are result of the different hardware architecures and specifications of the two GPUs, and not likely indicative of any bug.

Running a benchmark from matlab also sounds a bit precarious as it might incurr some unknown variable.

I think the best way to be certain that your kernel is indeed running slower is to run the Visual Profiler and get a really exact measurement, Also consider calling the kernel multiple times to account for any “warm-up” effects.

To run matlab and visual profiler you can pass something like:

matlab.exe -nojvm -nodesktop -wait -sd -r

Remember to:
Add an “exit;” at the end of your m-file script.
Add cudaDeviceReset() at the end of your cuda code call.

Other than that, i basically agree with Norberts conclusion.

I have both the Titan Black and the Titan X installed in my system.

I set the active device before running my code and both devices are slower than with the older, pre-Titan X drivers (v344 & v337).

I have since run more intensive number crunching code, and found the v350 drivers to take 3x longer than my previous times with the older drivers. The v353 driver improves to 2.8x longer.

Typically, performance regressions caused by changing drivers are not anywhere close to a factor of three. Such big differences are typically indicative of inadvertently performing a debug build instead of a release build. So I would suggest double checking the compilation settings to make sure you are comparing release builds for all benchmarked cases. In particular, the -G switch should not occur in release builds. Is the build performed offline or are you relying on JIT compilation?

You would defiitely want to run the app with the profiler to establish which (if any) GPU kernels are responsible for the massive slowdown, or to determine that the slowdown is in fact due to host-side code.

Standard benchmarking caveats about measuring performance in a “warmed-up” steady state under exclusion of one-time start-up costs etc apply. In a previous posting, Jimmy Pettersson also gave some additional advice specific to timing in a Matlab environment.

The mex functions are being compiled with Visual Studio, no debug, run with no debug monitor. Multiple runs.

We have run the same mex function, with the same environment on a different machine with a K80 installed (using one GPU) and it runs ~6x faster than the Titan X. :(

Previously we were seeing about the same speed on the K80 machine as the Titan Black, maybe 10-20% faster.

I ran the NSight profiler on this mex function on the Titan X and the 2 kernels are reported to take ~26us.

The K80 takes ~1.2ms for this mex function, while the Titan X takes ~7.2ms.

IME, the Titan X should run significantly faster than the half K80. The Titan Black was ~33% faster than the K20, but now it’s slower with the later v353 drivers.

BTW, all the data is float, no doubles, to get the fastest performance from a GTX board.

If I understand your data above correctly, the actual GPU kernels take up a miniscule portion of the total run time of the MEX function. So the bottleneck appears to be outside the GPU kernels, which is something you would want to look into, since it means the GPU is not used efficiently. E.g., is each MEX call going through a cudaMalloc / cudaFree cycle, rather than re-suing existing allocations?

Since you are on Windows, the first thing that comes to mind is that the Titan X is running with the WDDM driver while the K80 runs with the much more efficient TCC driver. So the differences in timing might come down to the inherent (in-)efficiency of the two driver models, although the difference you report above exceed what I have seen exerted as the typical “WDDM tax”.

An alternative hypothesis is that the slowdown might have something to do with the MEX mechanism and how it interacts with CUDA. I am not familiar with MEX at all. Have you inquired about your case in a relevant MATLAB forum to see whether other people have made similar observations?

Point is, I didn’t have similar problems previously with WDDM drivers and mex functions.

The Titan Black was 33% faster than the K20.

I just reinstalled the v344 drivers and tested the same mex function with the Titan Black.

The time for the mex function in a 1000x loop is 4.3ms / iter.

I updated the driver to 353.06 for the Titan X and the time is 5.9-6.5ms / iter.

Back to the Titan Black with 353.06 and the time is 5.8-6.4ms / iter.

That is a min 35% penalty just for updating the driver and no gain for the Titan X.

I expected some gain for the Titan X, not a 35% penalty.

Maybe it’s the PCI interface and not the cores, but both Titans are def slower with the 353 drivers than the Black with 344.

It looks like we will not be able to resolve the issue by means of a discussion in this forum. I have no idea what your code looks like, which makes remote diagnosis a guessing game.

It is possible that much of the overhead in the MEX function is due to driver functionality and that this driver functionality has different performance characteristics between driver versions. It is further possible that the functions involved are not typically considered performance-critical as they occur off the critical path in common use cases (e.g. creating and tearing down contexts), and therefore the slowdown may have gone unnoticed.

If you have self-contained repro code in hand that reproduces the issue, you could always file a bug report with NVIDIA regarding a performance regressions when upgrading from driver version 344 to driver version 353. The bug reporting form is linked from the registered developer website

If this were my code, I would also be highly concerned about the massive discrepancy between MEX function run time (4-5 milliseconds) and CUDA kernel run time (2 x 30 microseconds) and investigate the source of the massive overhead.

The mex function overhead is likely due to the large number of inputs to the kernels (~15), most of which are gpuArrays (some scalars), which are being processed and pointers extracted for CUDA kernel calls.

That is overhead required by mex calls and there is little if anything that can be done to reduce it. I have seen similar or greater performance penalties for other mex and Matlab gpu functions using the 353 drivers.

A fairly lengthy Matlab program consisting of all single gpuArray data and a mix of mex CUDA and Matlab (gpuArray processing) functions, takes ~30s with the Titan Black & v344 drivers and ~40s with a K20.

With the Titan X & v353, it takes ~82s. This is with an on average much higher ratio of floating point kernel processing to mex interface than the mex function discussed above. I think we can safely conclude that the 353 performance penalty is not entirely due to mex interface processing.

I use a Titan X with MATLAB daily (CUDA 6.5, GTX 980 for video, GTX Titan-X for compute, driver 352.86) and have no performance issues. The Titan-X far outperforms my older GTX 780ti, and there is very little overhead from the mex function and the WDDM driver.

Did you say that you pass over already allocated GPUArray pointers from the MATLAB to the CUDA mex function? I always pass over host pointers to the mex, copy to device memory in the typical way, and copy back to host MATLAB memory at the end. Maybe your particular method is causing come memory problems.

You can see exactly the approach I use for a mex file here;

[url]https://github.com/OlegKonings/BCI_EEG_blk_diag_admm_multi_lambda/blob/master/GroupMextest/GroupMextest/GLmex.cpp[/url]

So try not involving MATLAB at all with device memory (do not use GPUArray or anything else in that toolbox), and just let the CUDA mex deal with device memory.

In my experience such problems tend to be caused by MATLAB rather than CUDA.

Repeatedly copying host data to the GPU and back involves a lot more overhead.
For the above referenced mex function, that overhead is much greater than even the slower processing time we get with the new drivers. The mex function with copying takes ~30ms vs. ~6ms with the data already on the GPU. With the 344 drivers, that drops to ~4ms.

We are keeping the large array data completely in gpuArrays, to avoid that overhead.

Again, the difference is from the update of the driver from v344 to v350+, both using Matlab in exactly the same way.

I am seeing incremental improvements in successive driver releases, but the 355.82 driver still takes ~7.9 ms for a mex function that takes ~4.5 ms with the 344.11 driver on the Titan Black.

Meanwhile our (1/2) K80 with Tesla drivers gives vastly better performance than the Titan X with this application, when the Titan series used to be better than Tesla for float processing.

Why is there no response from Nvidia addressing this issue? Don’t want GTX to compete with Tesla?

The issue can’t be analyzed or addressed unless enough information is provided so that an engineer can reproduce your observations. You’re more likely to get assistance if you make it as easy as possible and very straightforward to reproduce your observations.

My suggestion would be to file a bug report. That bug report should include:

  1. A complete software repro case. That would include all necessary files to recreate your mexfunction, along with all compile and build steps. You should also include all data setup and test functions you are executing in your matlab script. There shouldn’t be anything that needs to be added or changed.

  2. Your CUDA version, OS/OS version, matlab version, driver version(s), test results including timing in each case as well as how you are doing the timing, and the specific GPUs you tested with. Including the output from nvidia-smi -a on your test machine (if necessary, in each case) will be useful also.

In short, everything needed for a relatively inexperienced person to recreate your observation. Looking over this thread, I don’t see anything approaching that level of clarity. And although you’re welcome to post (more or less) anything you wish in this forum, for best results I would suggest filing a bug report as a priority over providing the details here. You’re welcome to do both, of course, but failing to provide a properly documented bug report disadvantages you, from an attention standpoint.

Note that the code(s) provided for item 1 need not be your exact codes. It’s preferable if you can reduce the codes down to a complete example that just contains the necessary elements to reproduce the observation of interest (timing discrepancy, in this case).

Unfortunately, I have real work to do and don’t have time to spell it out in such detail.

I have boiled it down to a relatively simple mex function which I cannot easily share with you.

The ONLY difference between the 2 times is the driver, which I have repeatedly verified.

Windows 7 64 bit Professional, Matlab 14b, (currently) running CUDA v6.5.19, but also with prior versions.

I time by tic/toc with a 1000x loop after putting the data on the gpu. As I said, currently driver v355.82 compared to v344.11, but also every prior driver for Titan X.

I would expect at a minimum for Nvidia to investigate this discrepancy w/o my having to spend a lot of time sending you repro code. Unless you just don’t care that your latest drivers don’t take advantage of the Titan X (nor Black) hardware.