Titan X (with latest drivers) slower than Titan Black with older drivers

njuffa · September 14, 2015, 9:11pm

Filing a bug report is a necessary starting point for investigations into software issues, the next step is independent reproduction of the reported issue by the vendor or open source project providing the software. While filing a bug report requires effort (sometimes even considerable effort!), there is in general no alternative if one’s interest is to have the issue addressed / fixed.

Of course, filing a bug report does not provide any guarantee that an issue will be addressed in the time frame desired and envisioned by the filer. My experience in 30 years of filing bugs for software is that some issues go unfixed for years, both with commercial vendors and open source projects.

Note that these developer forums are a platform for the CUDA community to cooperate in users-helping-users fashion, it is not designed as an official bug reporting channel. NVIDIA provides a bug reporting form linked directly from the CUDA registered developer site for this purpose.

robosmith · September 14, 2015, 9:56pm

As long as your new hardware is slower than your old hardware due to driver issues, we have no reason to purchase your new hardware, until we can determine that it is actually faster.

So far we see speed degradation, not improvement.

I had high hopes for the Titan X…as every previous iteration was faster than what came before.

njuffa · September 15, 2015, 12:57am

I am merely pointing out the lay of the land as far as bug reporting goes, because I have been on both sides of that particular equation (filing bugs as a customer, fixing bugs as software developer). What I described is a process used by much of the industry, including open source projects, and I would think it is helpful to have realistic expectations about it.

Basing your purchasing decisions on performance measurements specific to your use case is a very good approach, better than relying on other people’s benchmarking efforts or theoretical peak performance numbers.

vnngoc156 · September 22, 2015, 6:48pm

I have similar problem after updating the driver from 344.75 to 355.82.
What I found out is that cudaMalloc() performance is slow significantly in latest driver version compared to previous driver versions. Following is my cudaMalloc’s timing result on same GTX TITAN Black and different driver versions. Apparently cudaMalloc() takes way too long to finish in latest driver compared to what it does in previous ones.

TestDriverPerformanceIssue.exe

Use device #0: GeForce GTX TITAN Black
Driver version: 344.75 (build: r343_00)
Allocating 1024 MB on GPU takes:
Host clock() based time : 24.000000 ms
Host C++11 std::chrono based time: 24.000000 ms
Host high precision timer based time: 23.945007 ms (timing code from njuffa in a different post)

TestDriverPerformanceIssue.exe

Use device #0: GeForce GTX TITAN Black
Driver version: 350.12 (build: r349_00)
Allocating 1024 MB on GPU takes:
Host clock() based time : 16.000000 ms
Host C++11 std::chrono based time: 16.000000 ms
Host high precision timer based time: 15.653536 ms (timing code from njuffa in a different post)

TestDriverPerformanceIssue.exe

Use device #0: GeForce GTX TITAN Black
Driver version: 355.82 (build: r355_00)
Allocating 1024 MB on GPU takes:
Host clock() based time : 101.000000 ms
Host C++11 std::chrono based time: 101.000000 ms
Host high precision timer based time: 101.253765 ms (timing code from njuffa in a different post)

@robosmith: does your timing code include cudaMalloc(…) function calls?

I am wondering if anyone gets the same problem with latest driver or has any idea what the cause is. I am waiting to test on new card (GTX TITAN X). If I can replicate the problem on the new card while not able to resolve the issue here, I think I will go ahead filing a bug report.

robosmith · September 22, 2015, 7:20pm

Yes, our code includes some cudaMalloc calls, but that is not the whole issue.

Being a mex function, we mostly just wrap gpuArrays which are created outside of the timing loop.

Drivers have been slower since at least v350 (first drivers supporting Titan X) but you have measured faster performance for cudaMalloc with v350.

There seems to be a generic PCI bus access slowdown, since the profiler shows the actual processing for my mex function to be very fast, but the total time including instruction issuing overhead to be significantly slower than with older drivers.

CudaaduC · September 23, 2015, 12:51am

vnngoc156:

I have similar problem after updating the driver from 344.75 to 355.82.
What I found out is that cudaMalloc() performance is slow significantly in latest driver version compared to previous driver versions. Following is my cudaMalloc’s timing result on same GTX TITAN Black and different driver versions. Apparently cudaMalloc() takes way too long to finish in latest driver compared to what it does in previous ones.

TestDriverPerformanceIssue.exe

Use device #0: GeForce GTX TITAN Black
Driver version: 344.75 (build: r343_00)
Allocating 1024 MB on GPU takes:
Host clock() based time : 24.000000 ms
Host C++11 std::chrono based time: 24.000000 ms
Host high precision timer based time: 23.945007 ms (timing code from njuffa in a different post)

TestDriverPerformanceIssue.exe

Use device #0: GeForce GTX TITAN Black
Driver version: 350.12 (build: r349_00)
Allocating 1024 MB on GPU takes:
Host clock() based time : 16.000000 ms
Host C++11 std::chrono based time: 16.000000 ms
Host high precision timer based time: 15.653536 ms (timing code from njuffa in a different post)

TestDriverPerformanceIssue.exe

Use device #0: GeForce GTX TITAN Black
Driver version: 355.82 (build: r355_00)
Allocating 1024 MB on GPU takes:
Host clock() based time : 101.000000 ms
Host C++11 std::chrono based time: 101.000000 ms
Host high precision timer based time: 101.253765 ms (timing code from njuffa in a different post)

@robosmith: does your timing code include cudaMalloc(…) function calls?

I am wondering if anyone gets the same problem with latest driver or has any idea what the cause is. I am waiting to test on new card (GTX TITAN X). If I can replicate the problem on the new card while not able to resolve the issue here, I think I will go ahead filing a bug report.

I can replicate this long cudaMalloc using a GTX Titan X with the CUDA 7.5(or CUDA 6.5 does not make a difference) on Windows 7 x64.

It takes about 100 ms to alloc any amount of device memory, from 1 byte to 1024*1024 bytes. The size does not make the difference, as there seems to be new overhead which was not the case with the older Nvidia drivers.

Also I tried using nvidia-smi to change the compute driver from WDDM to TCC for the Titan X which I heard supports TCC now.

I had admin privileges and used this command to switch;

C:\Program Files\NVIDIA Corporation\NVSMI>nvidia-smi -g 1 -dm 1
Unable to set driver model for GPU 0000:02:00.0: Insufficient Permissions
Terminating early due to previous errors.

According the the nvsmi documentation what should be the correct command line statement, unless I made some mistake or the GTX Titan X does not actually support TCC mode. According to nvidia-smi device 0 is the GTX 980 and device 1 is the GTX Titan X(which is opposite to the way CUDA enumerates the GPUs based on compute capability).

The GTX Titan X is not connected to the display.

Anyone been able to get the GTX Titan X into TCC mode? If so how?

I am wondering if there is still this long cudaMalloc issue if using the TCC driver.

RoBiK · September 23, 2015, 7:51am

CudaaduC:

Also I tried using nvidia-smi to change the compute driver from WDDM to TCC for the Titan X which I heard supports TCC now.

I had admin privileges and used this command to switch;
C:\Program Files\NVIDIA Corporation\NVSMI>nvidia-smi -g 1 -dm 1
Unable to set driver model for GPU 0000:02:00.0: Insufficient Permissions
Terminating early due to previous errors.

Did you use the admin command prompt? (Win+x, a)

njuffa · September 23, 2015, 9:43am

Based on vnngoc156’s data (identical hardware, differing driver versions), I would suggest filing a bug report with NVIDIA. The form is linked from the CUDA registered developer website. While some fluctuation in cudaMalloc() performance between driver versions is probably expected as the feature sets changes all the time, a five-fold increase seems suspicious and can hurt application-level performance.

CudaaduC · September 23, 2015, 5:46pm

RoBiK:

CudaaduC:
Also I tried using nvidia-smi to change the compute driver from WDDM to TCC for the Titan X which I heard supports TCC now.

I had admin privileges and used this command to switch;
C:\Program Files\NVIDIA Corporation\NVSMI>nvidia-smi -g 1 -dm 1
Unable to set driver model for GPU 0000:02:00.0: Insufficient Permissions
Terminating early due to previous errors.
Did you use the admin command prompt? (Win+x, a)

You are correct. I thought just having admin rights on a user account would be enough, but got it to work by opening running the command prompt as administrator.

I was able to switch to TCC mode for the Titan X , which made no difference in this cudaMalloc test. 100 ms for a 1 byte cudaMalloc seems excessive.

CudaaduC · September 23, 2015, 6:12pm

Can someone time a cudaMalloc in linux for comparison?

robosmith · September 23, 2015, 6:34pm

CudaaduC:

RoBiK:
CudaaduC:
Also I tried using nvidia-smi to change the compute driver from WDDM to TCC for the Titan X which I heard supports TCC now.

I had admin privileges and used this command to switch;
C:\Program Files\NVIDIA Corporation\NVSMI>nvidia-smi -g 1 -dm 1
Unable to set driver model for GPU 0000:02:00.0: Insufficient Permissions
Terminating early due to previous errors.
Did you use the admin command prompt? (Win+x, a)
You are correct. I thought just having admin rights on a user account would be enough, but got it to work by opening running the command prompt as administrator.

I was able to switch to TCC mode for the Titan X , which made no difference in this cudaMalloc test. 100 ms for a 1 byte cudaMalloc seems excessive.

Did you install TCC drivers to get that to work?

Works on CUDA v6.5?

CudaaduC · September 23, 2015, 6:46pm

robosmith:

CudaaduC:
RoBiK:
CudaaduC:
Also I tried using nvidia-smi to change the compute driver from WDDM to TCC for the Titan X which I heard supports TCC now.

I had admin privileges and used this command to switch;
C:\Program Files\NVIDIA Corporation\NVSMI>nvidia-smi -g 1 -dm 1
Unable to set driver model for GPU 0000:02:00.0: Insufficient Permissions
Terminating early due to previous errors.
Did you use the admin command prompt? (Win+x, a)
You are correct. I thought just having admin rights on a user account would be enough, but got it to work by opening running the command prompt as administrator.

I was able to switch to TCC mode for the Titan X , which made no difference in this cudaMalloc test. 100 ms for a 1 byte cudaMalloc seems excessive.
Did you install TCC drivers to get that to work?

Works on CUDA v6.5?

Did not have to install anything extra other than CUDA 7.5.

Robert_Crovella · September 24, 2015, 1:14am

7 microseconds

CUDA 7.5 (driver 352.39), Fedora 20

$ cat t921.cu
#include <iostream>

#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)

#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

long long dtime_usec(unsigned long long start){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

int main()
{
    char *a, *b;
    unsigned long long t1 = dtime_usec(0);
    cudaMalloc(&a, 1);
    t1 = dtime_usec(t1);
    unsigned long long t2 = dtime_usec(0);
    cudaMalloc(&b, 1);
    t2 = dtime_usec(t2);
    std::cout << "t1: " << t1/(float)USECPSEC << "s t2: " << t2/(float)USECPSEC << "s" <<  std::endl;
    return 0;
}
$ nvcc -o t921 t921.cu
$ ./t921
t1: 0.660166s t2: 7e-06s
$

CudaaduC · September 24, 2015, 2:21am

txbob:

7 microseconds

CUDA 7.5 (driver 352.39), Fedora 20

$ cat t921.cu
#include <iostream>

#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)

#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

long long dtime_usec(unsigned long long start){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}


int main()
{
    char *a, *b;
    unsigned long long t1 = dtime_usec(0);
    cudaMalloc(&a, 1);
    t1 = dtime_usec(t1);
    unsigned long long t2 = dtime_usec(0);
    cudaMalloc(&b, 1);
    t2 = dtime_usec(t2);
    std::cout << "t1: " << t1/(float)USECPSEC << "s t2: " << t2/(float)USECPSEC << "s" <<  std::endl;
    return 0;
}
$ nvcc -o t921 t921.cu
$ ./t921
t1: 0.660166s t2: 7e-06s
$

Thanks.

Wow that is depressing, especially since using the TCC driver on the Titan X made no difference.

100 ms on a Titan X for a cudaMalloc is totally unacceptable, so until this gets fixed I will have to move some of our systems to linux.

njuffa · September 24, 2015, 3:46am

Just be aware that once you do, you will probably never find a good reason to move back to Windows :-)

Ailleur · September 24, 2015, 11:38am

Don’t know if this data point is worth anything, but I have tested with a K20 on 353.90 drivers with a modified version of txbob’s test application to run on Windows (server 2012). Find the code below. The result is also 7us for the second allocation.
So, this is not a Windows-for-all-graphics-cards issue.

#include <iostream>


#include <time.h>
#include <windows.h> 

#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)




int main()
{
	LARGE_INTEGER StartingTime, EndingTime, ElapsedMicrosecond1, ElapsedMicrosecond2;
	LARGE_INTEGER Frequency;

	QueryPerformanceFrequency(&Frequency); 
	QueryPerformanceCounter(&StartingTime);
    char *a, *b;

    cudaMalloc(&a, 1);
    QueryPerformanceCounter(&EndingTime);
	ElapsedMicrosecond1.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart;

	ElapsedMicrosecond1.QuadPart *= 1000000;
	ElapsedMicrosecond1.QuadPart /= Frequency.QuadPart;

	QueryPerformanceCounter(&StartingTime);
    cudaMalloc(&b, 1);
    QueryPerformanceCounter(&EndingTime);
	ElapsedMicrosecond2.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart;

	ElapsedMicrosecond2.QuadPart *= 1000000;
	ElapsedMicrosecond2.QuadPart /= Frequency.QuadPart;
	std::cout << "t1: " << ElapsedMicrosecond1.QuadPart << "us t2: " << ElapsedMicrosecond2.QuadPart << "us" <<  std::endl;


    return 0;
}

robosmith · September 24, 2015, 3:39pm

Ailleur:

Don’t know if this data point is worth anything, but I have tested with a K20 on 353.90 drivers with a modified version of txbob’s test application to run on Windows (server 2012). Find the code below. The result is also 7us for the second allocation.
So, this is not a Windows-for-all-graphics-cards issue.

#include <iostream>


#include <time.h>
#include <windows.h> 

#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)




int main()
{
	LARGE_INTEGER StartingTime, EndingTime, ElapsedMicrosecond1, ElapsedMicrosecond2;
	LARGE_INTEGER Frequency;

	QueryPerformanceFrequency(&Frequency); 
	QueryPerformanceCounter(&StartingTime);
    char *a, *b;

    cudaMalloc(&a, 1);
    QueryPerformanceCounter(&EndingTime);
	ElapsedMicrosecond1.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart;

	ElapsedMicrosecond1.QuadPart *= 1000000;
	ElapsedMicrosecond1.QuadPart /= Frequency.QuadPart;

	QueryPerformanceCounter(&StartingTime);
    cudaMalloc(&b, 1);
    QueryPerformanceCounter(&EndingTime);
	ElapsedMicrosecond2.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart;

	ElapsedMicrosecond2.QuadPart *= 1000000;
	ElapsedMicrosecond2.QuadPart /= Frequency.QuadPart;
	std::cout << "t1: " << ElapsedMicrosecond1.QuadPart << "us t2: " << ElapsedMicrosecond2.QuadPart << "us" <<  std::endl;


    return 0;
}

That’s right, we are getting 10x better performance on some mex functions with 1/2 K80 than with Titan X. The best Titan card used to be faster than the best Tesla card for single math.

Unfortunately, Nvidia has not addressed the driver issues for CUDA on Titan since the X came out.

CudaaduC · September 24, 2015, 6:08pm

Ailleur:

Don’t know if this data point is worth anything, but I have tested with a K20 on 353.90 drivers with a modified version of txbob’s test application to run on Windows (server 2012). Find the code below. The result is also 7us for the second allocation.
So, this is not a Windows-for-all-graphics-cards issue.

#include <iostream>


#include <time.h>
#include <windows.h> 

#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)




int main()
{
	LARGE_INTEGER StartingTime, EndingTime, ElapsedMicrosecond1, ElapsedMicrosecond2;
	LARGE_INTEGER Frequency;

	QueryPerformanceFrequency(&Frequency); 
	QueryPerformanceCounter(&StartingTime);
    char *a, *b;

    cudaMalloc(&a, 1);
    QueryPerformanceCounter(&EndingTime);
	ElapsedMicrosecond1.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart;

	ElapsedMicrosecond1.QuadPart *= 1000000;
	ElapsedMicrosecond1.QuadPart /= Frequency.QuadPart;

	QueryPerformanceCounter(&StartingTime);
    cudaMalloc(&b, 1);
    QueryPerformanceCounter(&EndingTime);
	ElapsedMicrosecond2.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart;

	ElapsedMicrosecond2.QuadPart *= 1000000;
	ElapsedMicrosecond2.QuadPart /= Frequency.QuadPart;
	std::cout << "t1: " << ElapsedMicrosecond1.QuadPart << "us t2: " << ElapsedMicrosecond2.QuadPart << "us" <<  std::endl;


    return 0;
}

On my system with the Titan X using the TCC driver running that code;

t1: 123933us t2: 4us

So for your Tesla K20 the first allocation also takes only 7us ?

Robert_Crovella · September 24, 2015, 6:51pm

It’s the second allocation we care about, it’s 4us in your case.

The first allocation (or significant usage of the cuda runtime API) is always going to be a long one, as the cuda runtime is initializing, and the lazy initialization process tends to cause initialization time to appear in the first cuda runtime API call in your application.

The fact that the first such usage takes extra time in and of itself is not a bug - it is expected behavior, and I don’t believe there is much change in (conceptual) behavior there in recent cuda versions (although certainly the nuances of lazy initialization may have changed somewhat, and the exact timing of each test case will probably be different).

Once the initialization “cost” is paid, then subsequent runtime API calls should run at “approximately full speed”.

Ailleur · September 24, 2015, 11:16pm

Right, my 7us was for the second allocation. Run a cudaSetDevice(0) (or whatever correct device id) before the first allocation and they should both be ~5us then.

So then, 5us on the Titan X. Where was this original 100ms figure coming from? Or were you only looking at the first allocation in your original measurement?

Topic		Replies	Views
cudaMalloc(Pitch) _significantly_ slower on windows with Geforce drivers > 350.12 CUDA Programming and Performance	10	2562	February 10, 2017
why cudaGetDeviceProperties and cudaMallocPitch consume a lot of time CUDA Programming and Performance	18	2394	January 9, 2017
Pascal Titan X benchmark thread CUDA Programming and Performance	19	4665	August 12, 2016
Grim memory bandwidth GTX 1080 CUDA Programming and Performance	127	30751	July 20, 2017
Why cudaStream in Titan V is slower than P4000? CUDA Programming and Performance	8	805	December 22, 2019
Speed difference for same CUDA code under Windows/Linux CUDA Programming and Performance	24	46007	March 17, 2010
TitanX slower than CPU (Tensorflow), possible configuration issue CUDA Programming and Performance	9	4526	April 13, 2016
Maxwell suddernly becomes 10x slower CUDA Programming and Performance	15	4605	February 24, 2016
cudaMalloc hang when building x64 version binary CUDA Programming and Performance	23	3892	June 26, 2017
CUDA Toolkit 3.2 release candidate available to registered developers CUDA Programming and Performance	68	63212	December 3, 2010

Titan X (with latest drivers) slower than Titan Black with older drivers

Use device #0: GeForce GTX TITAN Black Driver version: 344.75 (build: r343_00) Allocating 1024 MB on GPU takes: Host clock() based time : 24.000000 ms Host C++11 std::chrono based time: 24.000000 ms Host high precision timer based time: 23.945007 ms (timing code from njuffa in a different post)

Use device #0: GeForce GTX TITAN Black Driver version: 350.12 (build: r349_00) Allocating 1024 MB on GPU takes: Host clock() based time : 16.000000 ms Host C++11 std::chrono based time: 16.000000 ms Host high precision timer based time: 15.653536 ms (timing code from njuffa in a different post)

Use device #0: GeForce GTX TITAN Black Driver version: 355.82 (build: r355_00) Allocating 1024 MB on GPU takes: Host clock() based time : 101.000000 ms Host C++11 std::chrono based time: 101.000000 ms Host high precision timer based time: 101.253765 ms (timing code from njuffa in a different post)

Related topics

Use device #0: GeForce GTX TITAN Black
Driver version: 344.75 (build: r343_00)
Allocating 1024 MB on GPU takes:
Host clock() based time : 24.000000 ms
Host C++11 std::chrono based time: 24.000000 ms
Host high precision timer based time: 23.945007 ms (timing code from njuffa in a different post)

Use device #0: GeForce GTX TITAN Black
Driver version: 350.12 (build: r349_00)
Allocating 1024 MB on GPU takes:
Host clock() based time : 16.000000 ms
Host C++11 std::chrono based time: 16.000000 ms
Host high precision timer based time: 15.653536 ms (timing code from njuffa in a different post)

Use device #0: GeForce GTX TITAN Black
Driver version: 355.82 (build: r355_00)
Allocating 1024 MB on GPU takes:
Host clock() based time : 101.000000 ms
Host C++11 std::chrono based time: 101.000000 ms
Host high precision timer based time: 101.253765 ms (timing code from njuffa in a different post)