Filing a bug report is a necessary starting point for investigations into software issues, the next step is independent reproduction of the reported issue by the vendor or open source project providing the software. While filing a bug report requires effort (sometimes even considerable effort!), there is in general no alternative if one’s interest is to have the issue addressed / fixed.
Of course, filing a bug report does not provide any guarantee that an issue will be addressed in the time frame desired and envisioned by the filer. My experience in 30 years of filing bugs for software is that some issues go unfixed for years, both with commercial vendors and open source projects.
Note that these developer forums are a platform for the CUDA community to cooperate in users-helping-users fashion, it is not designed as an official bug reporting channel. NVIDIA provides a bug reporting form linked directly from the CUDA registered developer site for this purpose.
As long as your new hardware is slower than your old hardware due to driver issues, we have no reason to purchase your new hardware, until we can determine that it is actually faster.
So far we see speed degradation, not improvement.
I had high hopes for the Titan X…as every previous iteration was faster than what came before.
I am merely pointing out the lay of the land as far as bug reporting goes, because I have been on both sides of that particular equation (filing bugs as a customer, fixing bugs as software developer). What I described is a process used by much of the industry, including open source projects, and I would think it is helpful to have realistic expectations about it.
Basing your purchasing decisions on performance measurements specific to your use case is a very good approach, better than relying on other people’s benchmarking efforts or theoretical peak performance numbers.
I have similar problem after updating the driver from 344.75 to 355.82.
What I found out is that cudaMalloc() performance is slow significantly in latest driver version compared to previous driver versions. Following is my cudaMalloc’s timing result on same GTX TITAN Black and different driver versions. Apparently cudaMalloc() takes way too long to finish in latest driver compared to what it does in previous ones.
TestDriverPerformanceIssue.exe
Use device #0: GeForce GTX TITAN Black
Driver version: 344.75 (build: r343_00)
Allocating 1024 MB on GPU takes:
Host clock() based time : 24.000000 ms
Host C++11 std::chrono based time: 24.000000 ms
Host high precision timer based time: 23.945007 ms (timing code from njuffa in a different post)
TestDriverPerformanceIssue.exe
Use device #0: GeForce GTX TITAN Black
Driver version: 350.12 (build: r349_00)
Allocating 1024 MB on GPU takes:
Host clock() based time : 16.000000 ms
Host C++11 std::chrono based time: 16.000000 ms
Host high precision timer based time: 15.653536 ms (timing code from njuffa in a different post)
TestDriverPerformanceIssue.exe
Use device #0: GeForce GTX TITAN Black
Driver version: 355.82 (build: r355_00)
Allocating 1024 MB on GPU takes:
Host clock() based time : 101.000000 ms
Host C++11 std::chrono based time: 101.000000 ms
Host high precision timer based time: 101.253765 ms (timing code from njuffa in a different post)
@robosmith: does your timing code include cudaMalloc(…) function calls?
I am wondering if anyone gets the same problem with latest driver or has any idea what the cause is. I am waiting to test on new card (GTX TITAN X). If I can replicate the problem on the new card while not able to resolve the issue here, I think I will go ahead filing a bug report.
Yes, our code includes some cudaMalloc calls, but that is not the whole issue.
Being a mex function, we mostly just wrap gpuArrays which are created outside of the timing loop.
Drivers have been slower since at least v350 (first drivers supporting Titan X) but you have measured faster performance for cudaMalloc with v350.
There seems to be a generic PCI bus access slowdown, since the profiler shows the actual processing for my mex function to be very fast, but the total time including instruction issuing overhead to be significantly slower than with older drivers.
I can replicate this long cudaMalloc using a GTX Titan X with the CUDA 7.5(or CUDA 6.5 does not make a difference) on Windows 7 x64.
It takes about 100 ms to alloc any amount of device memory, from 1 byte to 1024*1024 bytes. The size does not make the difference, as there seems to be new overhead which was not the case with the older Nvidia drivers.
Also I tried using nvidia-smi to change the compute driver from WDDM to TCC for the Titan X which I heard supports TCC now.
I had admin privileges and used this command to switch;
C:\Program Files\NVIDIA Corporation\NVSMI>nvidia-smi -g 1 -dm 1
Unable to set driver model for GPU 0000:02:00.0: Insufficient Permissions
Terminating early due to previous errors.
According the the nvsmi documentation what should be the correct command line statement, unless I made some mistake or the GTX Titan X does not actually support TCC mode. According to nvidia-smi device 0 is the GTX 980 and device 1 is the GTX Titan X(which is opposite to the way CUDA enumerates the GPUs based on compute capability).
The GTX Titan X is not connected to the display.
Anyone been able to get the GTX Titan X into TCC mode? If so how?
I am wondering if there is still this long cudaMalloc issue if using the TCC driver.
Based on vnngoc156’s data (identical hardware, differing driver versions), I would suggest filing a bug report with NVIDIA. The form is linked from the CUDA registered developer website. While some fluctuation in cudaMalloc() performance between driver versions is probably expected as the feature sets changes all the time, a five-fold increase seems suspicious and can hurt application-level performance.
You are correct. I thought just having admin rights on a user account would be enough, but got it to work by opening running the command prompt as administrator.
I was able to switch to TCC mode for the Titan X , which made no difference in this cudaMalloc test. 100 ms for a 1 byte cudaMalloc seems excessive.
Don’t know if this data point is worth anything, but I have tested with a K20 on 353.90 drivers with a modified version of txbob’s test application to run on Windows (server 2012). Find the code below. The result is also 7us for the second allocation.
So, this is not a Windows-for-all-graphics-cards issue.
That’s right, we are getting 10x better performance on some mex functions with 1/2 K80 than with Titan X. The best Titan card used to be faster than the best Tesla card for single math.
Unfortunately, Nvidia has not addressed the driver issues for CUDA on Titan since the X came out.
It’s the second allocation we care about, it’s 4us in your case.
The first allocation (or significant usage of the cuda runtime API) is always going to be a long one, as the cuda runtime is initializing, and the lazy initialization process tends to cause initialization time to appear in the first cuda runtime API call in your application.
The fact that the first such usage takes extra time in and of itself is not a bug - it is expected behavior, and I don’t believe there is much change in (conceptual) behavior there in recent cuda versions (although certainly the nuances of lazy initialization may have changed somewhat, and the exact timing of each test case will probably be different).
Once the initialization “cost” is paid, then subsequent runtime API calls should run at “approximately full speed”.
Right, my 7us was for the second allocation. Run a cudaSetDevice(0) (or whatever correct device id) before the first allocation and they should both be ~5us then.
So then, 5us on the Titan X. Where was this original 100ms figure coming from? Or were you only looking at the first allocation in your original measurement?