Titan X (with latest drivers) slower than Titan Black with older drivers

njuffa · September 24, 2015, 11:42pm

The canonical way to trigger CUDA context initiaization used to be a call cudaFree(0). I don’t think that has changed? Any performance measurements on CUDA APIs should not include context initialization time. I would have thought that this is common knowledge eight years into CUDA’s public existence, but maybe not.

NVIDIA may want to consider adding a sticky post to these forums pointing this out.

CudaaduC · September 25, 2015, 2:00am

robosmith:

Ailleur:
Don’t know if this data point is worth anything, but I have tested with a K20 on 353.90 drivers with a modified version of txbob’s test application to run on Windows (server 2012). Find the code below. The result is also 7us for the second allocation.
So, this is not a Windows-for-all-graphics-cards issue.
#include <iostream>

#include <time.h>
#include <windows.h> 

#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)

int main()
{
	LARGE_INTEGER StartingTime, EndingTime, ElapsedMicrosecond1, ElapsedMicrosecond2;
	LARGE_INTEGER Frequency;

	QueryPerformanceFrequency(&Frequency); 
	QueryPerformanceCounter(&StartingTime);
    char *a, *b;

    cudaMalloc(&a, 1);
    QueryPerformanceCounter(&EndingTime);
	ElapsedMicrosecond1.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart;

	ElapsedMicrosecond1.QuadPart *= 1000000;
	ElapsedMicrosecond1.QuadPart /= Frequency.QuadPart;

	QueryPerformanceCounter(&StartingTime);
    cudaMalloc(&b, 1);
    QueryPerformanceCounter(&EndingTime);
	ElapsedMicrosecond2.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart;

	ElapsedMicrosecond2.QuadPart *= 1000000;
	ElapsedMicrosecond2.QuadPart /= Frequency.QuadPart;
	std::cout << "t1: " << ElapsedMicrosecond1.QuadPart << "us t2: " << ElapsedMicrosecond2.QuadPart << "us" <<  std::endl;

return 0;
}
That’s right, we are getting 10x better performance on some mex functions with 1/2 K80 than with Titan X. The best Titan card used to be faster than the best Tesla card for single math.

Unfortunately, Nvidia has not addressed the driver issues for CUDA on Titan since the X came out.

Have you tried using the TCC driver with the Titan X? If so did it help your bottleneck?

robosmith · September 25, 2015, 5:23pm

I didn’t even know about TCC on Titan X until a couple of days ago.

I am currently configured to use v344 drivers with my old Titan Black since it is faster, but I will try TCC with the Titan X soon.

robosmith · October 13, 2015, 5:40pm

Just installed the latest greatest v358.50 driver.

Of three mex functions, 2 are within 10% of the v344 drivers on Titan Black (One is 10% slower).

The 3rd, which is the most I/O intensive (largest # of inputs) is still 30% slower (7.1ms vs. 5.5ms) on the Titan Black with the v358 driver. This is timing AFTER the data is transferred to the GPU. ONLY difference is the driver.

Now I will re-install the Titan X and see if that offers any improvement.

robosmith · October 13, 2015, 6:34pm

With Titan X v358, one function is twice as fast as Titan Black v344.
One is 20% faster.

And the problem child is still 10% slower (vs. 30% slower on Titan Black v358).

So while the diver is getting better, it is still slower for functions which have a large number of inputs to the kernel.

robosmith · October 13, 2015, 7:05pm

And our entire algorithm running on Matlab with a mix of native gpuArray functions and mex CUDA functions, takes 2.5x longer running on the Titan X v358, than it did on Titan Black v344.

So it seems GeForce driver development still has a long way to go for Titan X.

Topic		Replies	Views
cudaMalloc(Pitch) _significantly_ slower on windows with Geforce drivers > 350.12 CUDA Programming and Performance	10	2562	February 10, 2017
why cudaGetDeviceProperties and cudaMallocPitch consume a lot of time CUDA Programming and Performance	18	2396	January 9, 2017
Pascal Titan X benchmark thread CUDA Programming and Performance	19	4667	August 12, 2016
Grim memory bandwidth GTX 1080 CUDA Programming and Performance	127	30752	July 20, 2017
Why cudaStream in Titan V is slower than P4000? CUDA Programming and Performance	8	806	December 22, 2019
Speed difference for same CUDA code under Windows/Linux CUDA Programming and Performance	24	46007	March 17, 2010
TitanX slower than CPU (Tensorflow), possible configuration issue CUDA Programming and Performance	9	4526	April 13, 2016
Maxwell suddernly becomes 10x slower CUDA Programming and Performance	15	4605	February 24, 2016
cudaMalloc hang when building x64 version binary CUDA Programming and Performance	23	3892	June 26, 2017
CUDA Toolkit 3.2 release candidate available to registered developers CUDA Programming and Performance	68	63212	December 3, 2010

Titan X (with latest drivers) slower than Titan Black with older drivers

Related topics