cuInit or cudaSetDevice is horribly slow on Kepler K20c, fast on Fermi S2050

moleres · December 14, 2013, 12:38am

Hi,

The time for cuInit() or cudaSetDevice() to return on our Tesla K20c Kepler GPUs is about 10 to 20 (!) times slower than on our Tesla S2050 Fermi GPUs. Yes, we have persistence mode enabled, have tried the nvidia-smi loop thing, and the latest driver. Does anyone have an idea as to why it’s so slow? Is this a driver issue, an OS issue, or could there be some hardware issue at play here? It’s definitely a degradation from our previous platform. The bandwidth test in the sdk shows much better performance in all areas with the Keplers, but initialization is horribly [url][/url]slow.

I took the sdk’s deviceQueryDrv and modified it simply to time the cuInit() call as shown below, then wrote a simple script to spawn 1 to N of these apps (also below). Here’s the output:

On Fermi:
Starting…
cuInit time = 0.202069

On Kepler:
Starting…
cuInit time = 1.753236

We have many multi-threaded applications (maybe as many as 20) that may start up simultaneously (roughly), we can tolerate some delay, but not that much. Again, this works fine on our Fermi platform.
If you run N instance, say 10, the Fermi stays relatively fast, whereas the Kepler simply bogs down – all of the instances get slow, not progressively slow like the Fermis.

(10 apps started at roughly the same time):
On Fermi:
Starting…
cuInit time = 0.200683
Starting…
Starting…
Starting…
Starting…
Starting…
Starting…
Starting…
Starting…
Starting…
cuInit time = 0.200975
cuInit time = 0.557699
cuInit time = 0.941136
cuInit time = 1.026017
cuInit time = 1.166268
cuInit time = 1.186676
cuInit time = 1.287552
cuInit time = 1.319250
cuInit time = 1.470359

On Kepler:
Starting…
Starting…
Starting…
Starting…
Starting…
Starting…
Starting…
Starting…
Starting…
Starting…
cuInit time = 18.169408
cuInit time = 19.190488
cuInit time = 19.949257
cuInit time = 20.858861
cuInit time = 21.737321
cuInit time = 22.568534
cuInit time = 23.402647
cuInit time = 24.295237
cuInit time = 24.289321
cuInit time = 24.282347

Egad! Look how slow the Kepler is! Adding “strace –tt” to the front of this app shows that most of the time is spent calling ioctl, to the tune of 9000+ times for a single call to cuInit() (yes, over 92,000 times if you spawn 10 instances).

I see this line over 9000 times in the strace output:
ioctl(3, 0xc0a0464a, 0x7fff76c93870) = 0[b][/b]

What gives?

Some HW information:
Fermi hardware:

X86_64 running RHEL 5.5
32 cores, 64GB memory
CUDA 4.0 (I think)
Nvidia-smi –q snippet:
Driver Version : 270.41.19
Attached GPUs : 4
GPU 0:A:0
Product Name : Tesla T20 Processor
Display Mode : Disabled
Persistence Mode : Enabled

Kepler hardware:

X86_64 running RHEL 6.3
32 cores, 256GB memory
CUDA 5.0
Nvidia-smi –q snippet:
Driver Version : 331.20
Attached GPUs : 4
GPU 0000:89:00.0
Product Name : Tesla K20c
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Enabled
Accounting Mode : Disabled

I’ve attached the sample program and run script - can someone try it on their Kepler/Fermi? I can’t generate an nvidia-bug-report since my sysadmin isn’t here and I don’t have root permission.

Thanks,
Rick

Here’s the simple app (use the deviceQueryDrv makefile in your CUDA 5.0 sdk):
// includes, system
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <sys/time.h>

#include <cuda.h>
#include <helper_cuda_drvapi.h>
#include <drvapi_error_string.h>

////////////////////////////////////////////////////////////////////////////////
// Program main
////////////////////////////////////////////////////////////////////////////////
int
main(int argc, char **argv)
{
CUdevice dev;
struct timeval timeStart, timeEnd;

fprintf(stdout,"Starting...\n");
fflush(stdout);

(void)gettimeofday(&timeStart, NULL);
CUresult error_id = cuInit(0);
(void)gettimeofday(&timeEnd, NULL);
fprintf(stdout,"cuInit time = %lf\n", ((timeEnd.tv_sec+(timeEnd.tv_usec*0.000001)) - (timeStart.tv_sec +(timeStart.tv_usec*0.000001))));
fflush(stdout);

if (error_id != CUDA_SUCCESS)
{
    printf("cuInit(0) returned %d\n-> %s\n", error_id, getCudaDrvErrorString(error_id));
    exit(EXIT_FAILURE);
}

exit(EXIT_SUCCESS);

}

Here’s the simple shell script to spawn 1 to N:
#!/bin/bash

Script to spawn 1 to N of the above apps

num=$1

if [ “$num” != “” ] ; then
while [ $num -gt 0 ] ; do
./deviceInitTimeDrv &
num=expr $num - 1
done
else
./deviceInitTimeDrv
fi

mfatica · December 14, 2013, 1:20am

The issue is not related to Fermi or Kepler but to the amount of system memory.
We are working to reduce the start time, if would be helpful if you could file a bug to help us prioritize the work.

moleres · December 16, 2013, 11:22pm

Thanks for the reply. Yes, I’ll file a bug. So this would have been a problem on my old platform if I had 256GB of system memory there as well?

If you’re aware of any workarounds besides pulling DIMMS on the system please let me know. Without a way around this the only other solution I can think of (granted I haven’t thought too hard about this hoping it was a simple OS or driver/GPU tunable) would be to redesign our software to have some centralized GPU manager that takes that initialization hit, and funnel all image streams as they come and go into it - which unfortunately is not within the scope of our time or budget.

sBc-Random · December 16, 2013, 11:31pm

I noticed the same thing in my code,

This seems to speed up the process (note each thread only uses 1 gpu, not sure how you’d generalize this):

n = omp_get_thread_num();
	int dev;
    CUDA(cudaSetDevice(n));
	CUDA(cudaDeviceReset());
	CUDA(cudaSetDevice(n));
	CUDA(cudaGetDevice(&dev));
	CUDA(cudaDeviceSynchronize());

This isn’t much help to you obviously, but it’s interesting that this fixes the problem, don’t you think?

NYCPat · June 17, 2014, 2:50pm

Hello - I’m also seeing this problem, which makes our code to query the free memory on a multi-kepler machine exceptionally slow. Does anyone know the progress of this, or if there’s somewhere else to look to track this bug? Thanks.

Topic		Replies	Views
Noob Alert: Tesla K20 slower than GTX 580? CUDA Programming and Performance	24	9149	November 3, 2013
why 2.9 seconds to start tesla K20 CUDA Programming and Performance	12	1227	March 4, 2018
really slow cudaGetDeviceCount() several seconds to complete a cudaGetDeviceCount() call CUDA Programming and Performance	3	1214	May 18, 2011
cudamalloc slow on Kepler K10 CUDA Programming and Performance	9	1104	October 28, 2014
Tesla C2050 slower than GeForce 8800? CUDA Programming and Performance	14	20905	April 20, 2011
Peaks and slow performance with cudaDeviceSynchronize CUDA Programming and Performance cuda	6	2776	November 17, 2021
slow performance cusparse spmv CUDA Programming and Performance	14	2880	December 9, 2013
cudaSetDevice() time, so weird! cudaSetDevice() take a long time. CUDA Programming and Performance	10	4590	August 2, 2010
Performance drop when changing Toolkit/SDK/driver from 4.2 ---> 5.0 CUDA Programming and Performance	14	2803	November 28, 2013
Why kernel calculate speed got slower after waiting for a while? CUDA Programming and Performance cuda	9	1763	July 19, 2022

cuInit or cudaSetDevice is horribly slow on Kepler K20c, fast on Fermi S2050

Script to spawn 1 to N of the above apps

Related topics