cudaGetDeviceProperties executes very slow on GTX 980

Hello all,

I’ve just got the new GTX980 to work with and while working with it I encountered the problem that the execution time of cudaGetDeviceProperties is painstakingly slow. Right now, the function needs roughly 0.25 ms to execute while on the old system it only needs 0.025 ms. Can anyone explain to me why it’s this much slower?

Here is some information about the systems and the code.

Specs of the current system
Dell Precision T7600 Workstation
CPU: Intel Xeon E5-2630 @ 2.30 GHz
GPU: NVIDIA GeForce GTX 980
RAM: 4x8 GB 1333 MHz ECC RDIMM in 4 channel mode
HDD: 255 GB SSD
OS: Ubuntu 14.04
Driver Verstion: 343.22
CUDA Version: 6.5 with support for GTX 9xx GPUs

Specs of the old system
CPU: Intel Core i7-2600S @ 2.8 GHz
GPU: NVIDIA GeFordce GTX 680
RAM 4x4 GB 1333 MHz in 2 channel mode
HDD 255 GB SSD
OS: Ubuntu 12.04
Driver Verstion: 331.62
CUDA Version: 6.0
(The CPU and memory of the old system are overclocked, though I don’t know any details since I just work with these workstations)

Host code

#include <stdio.h>
#include <stdlib.h>

// host main function
int main(void) {

	// define number of runs
	int runs = 50;

	// create events
	cudaEvent_t start, stop;
	cudaEventCreate(&start);
	cudaEventCreate(&stop);
	float time;

	// get device props
	cudaEventRecord(start, 0);

	for (int run = 0; run < runs; run++)
	{
		cudaDeviceProp prop;
		int device;
		cudaGetDevice(&device);
		cudaGetDeviceProperties(&prop, device);
	}
	cudaEventRecord(stop, 0);
	cudaEventSynchronize(stop);
	cudaEventElapsedTime(&time, start, stop);
	printf("Time to get device properties: %5.5f ms.\n", time/runs);

	return 0;
}

Compiled with: nvcc -v -O2 GetDeviceProp.cu -o getProps

Is the difference in execution simply to explain with the overclocked system parts or is there something else to consider?

Thanks for your time. Any help is much appreciated.

Best regards,

Trammy

A few questions for clarification:

(1) What is the operating system on the two respective systems?
(2) Are both systems using the same CUDA toolkit and CUDA driver versions?
(3) How is the (sub-millisecond) execution time of cudaGetDeviceProperties() relevant to application performance, given that this is a function usually called once at the start of an application? The execution time for cudaGetDeviceProperties() is probably well below the CUDA context initialization time.

Thanks for your reply. To answer your questions

(1) The current system uses Ubuntu 14.04, the old system Ubuntu 12.04

(2) Both systems are using the CUDA Toolkit 6.5. On the current system the version with the support for the GTX 9xx GPUs is installed.
The old system runs with driver version 331.62 and the current one with driver version 343.22.

(3) Well, I was using the function getNumBlocksAndThreads from the CUDA-reduction sample in my code and it was called a couple of times. There, prop.maxGridSize and prop.maxThreadsPerBlock are used and the device properties are fetched in every function call. I never gave much thought about it and I now realize that it is not very smart to use the cudaGetDeviceProperties function in the first place. My program needs to be executed in <30 ms and therefore the cumulative timing of this function did matter. I’ve written a workaround and everything is fine. So my question is more out of curiosity.

Cheers,

Trammy

Your assumption seems to be that the type of GPU leads to a different execution time. But given that different OS versions and different driver versions are involved, one cannot conclude that. One would have to do controlled experiments, in which only a single variable is changed at a time. I would think that even CPU performance is one of those variables, as the driver code executes on the CPU.

You may also want to re-think the measurement methodology. Often the first execution of a piece of code triggers additional “penalties”, anything from cache misses to move code into the CPUs ICache to one-time initialization overhead for device hardware. A better methodology usually is to report the best time out of N runs, where N >= 2. This is the methodology used by the well-known STREAM benchmark for example, which uses N=10 by default.

CUDA 6.5 is not compatible with driver 331.62

Thanks again for you replies.

You are right, I checked and CUDA 6.0 is installed on the old system. My mistake.

I guess it won’t be possible for me to find out what exactly triggered the high execution times since I can’t change the hardware of the workstations. But thanks for your input. Luckily it was not hard to rewrite the code.

txbob raises an important point. If the driver version 331.62 is insufficient to run CUDA 6.5, it stands to reason that the execution time of cudaGetDeviceProperties() was lower on the older system because the API call failed and returned right away. A best practice is to check the return status of all API calls.