cuInit or cudaSetDevice is horribly slow on Kepler K20c, fast on Fermi S2050

Hi,

The time for cuInit() or cudaSetDevice() to return on our Tesla K20c Kepler GPUs is about 10 to 20 (!) times slower than on our Tesla S2050 Fermi GPUs. Yes, we have persistence mode enabled, have tried the nvidia-smi loop thing, and the latest driver. Does anyone have an idea as to why it’s so slow? Is this a driver issue, an OS issue, or could there be some hardware issue at play here? It’s definitely a degradation from our previous platform. The bandwidth test in the sdk shows much better performance in all areas with the Keplers, but initialization is horribly [url][/url]slow.

I took the sdk’s deviceQueryDrv and modified it simply to time the cuInit() call as shown below, then wrote a simple script to spawn 1 to N of these apps (also below). Here’s the output:

On Fermi:
Starting…
cuInit time = 0.202069

On Kepler:
Starting…
cuInit time = 1.753236

We have many multi-threaded applications (maybe as many as 20) that may start up simultaneously (roughly), we can tolerate some delay, but not that much. Again, this works fine on our Fermi platform.
If you run N instance, say 10, the Fermi stays relatively fast, whereas the Kepler simply bogs down – all of the instances get slow, not progressively slow like the Fermis.

(10 apps started at roughly the same time):
On Fermi:
Starting…
cuInit time = 0.200683
Starting…
Starting…
Starting…
Starting…
Starting…
Starting…
Starting…
Starting…
Starting…
cuInit time = 0.200975
cuInit time = 0.557699
cuInit time = 0.941136
cuInit time = 1.026017
cuInit time = 1.166268
cuInit time = 1.186676
cuInit time = 1.287552
cuInit time = 1.319250
cuInit time = 1.470359

On Kepler:
Starting…
Starting…
Starting…
Starting…
Starting…
Starting…
Starting…
Starting…
Starting…
Starting…
cuInit time = 18.169408
cuInit time = 19.190488
cuInit time = 19.949257
cuInit time = 20.858861
cuInit time = 21.737321
cuInit time = 22.568534
cuInit time = 23.402647
cuInit time = 24.295237
cuInit time = 24.289321
cuInit time = 24.282347

Egad! Look how slow the Kepler is! Adding “strace –tt” to the front of this app shows that most of the time is spent calling ioctl, to the tune of 9000+ times for a single call to cuInit() (yes, over 92,000 times if you spawn 10 instances).

I see this line over 9000 times in the strace output:
ioctl(3, 0xc0a0464a, 0x7fff76c93870) = 0[b][/b]

What gives?

Some HW information:
Fermi hardware:

  • X86_64 running RHEL 5.5
  • 32 cores, 64GB memory
  • CUDA 4.0 (I think)
  • Nvidia-smi –q snippet:
    Driver Version : 270.41.19
    Attached GPUs : 4
    GPU 0:A:0
    Product Name : Tesla T20 Processor
    Display Mode : Disabled
    Persistence Mode : Enabled

Kepler hardware:

  • X86_64 running RHEL 6.3
  • 32 cores, 256GB memory
  • CUDA 5.0
  • Nvidia-smi –q snippet:
    Driver Version : 331.20
    Attached GPUs : 4
    GPU 0000:89:00.0
    Product Name : Tesla K20c
    Display Mode : Disabled
    Display Active : Disabled
    Persistence Mode : Enabled
    Accounting Mode : Disabled

I’ve attached the sample program and run script - can someone try it on their Kepler/Fermi? I can’t generate an nvidia-bug-report since my sysadmin isn’t here and I don’t have root permission.

Thanks,
Rick

Here’s the simple app (use the deviceQueryDrv makefile in your CUDA 5.0 sdk):
// includes, system
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <sys/time.h>

#include <cuda.h>
#include <helper_cuda_drvapi.h>
#include <drvapi_error_string.h>

////////////////////////////////////////////////////////////////////////////////
// Program main
////////////////////////////////////////////////////////////////////////////////
int
main(int argc, char **argv)
{
CUdevice dev;
struct timeval timeStart, timeEnd;

fprintf(stdout,"Starting...\n");
fflush(stdout);

(void)gettimeofday(&timeStart, NULL);
CUresult error_id = cuInit(0);
(void)gettimeofday(&timeEnd, NULL);
fprintf(stdout,"cuInit time = %lf\n", ((timeEnd.tv_sec+(timeEnd.tv_usec*0.000001)) - (timeStart.tv_sec +(timeStart.tv_usec*0.000001))));
fflush(stdout);

if (error_id != CUDA_SUCCESS)
{
    printf("cuInit(0) returned %d\n-> %s\n", error_id, getCudaDrvErrorString(error_id));
    exit(EXIT_FAILURE);
}

exit(EXIT_SUCCESS);

}

Here’s the simple shell script to spawn 1 to N:
#!/bin/bash

Script to spawn 1 to N of the above apps

num=$1

if [ “$num” != “” ] ; then
while [ $num -gt 0 ] ; do
./deviceInitTimeDrv &
num=expr $num - 1
done
else
./deviceInitTimeDrv
fi

The issue is not related to Fermi or Kepler but to the amount of system memory.
We are working to reduce the start time, if would be helpful if you could file a bug to help us prioritize the work.

Thanks for the reply. Yes, I’ll file a bug. So this would have been a problem on my old platform if I had 256GB of system memory there as well?

If you’re aware of any workarounds besides pulling DIMMS on the system please let me know. Without a way around this the only other solution I can think of (granted I haven’t thought too hard about this hoping it was a simple OS or driver/GPU tunable) would be to redesign our software to have some centralized GPU manager that takes that initialization hit, and funnel all image streams as they come and go into it - which unfortunately is not within the scope of our time or budget.

I noticed the same thing in my code,

This seems to speed up the process (note each thread only uses 1 gpu, not sure how you’d generalize this):

n = omp_get_thread_num();
	int dev;
    CUDA(cudaSetDevice(n));
	CUDA(cudaDeviceReset());
	CUDA(cudaSetDevice(n));
	CUDA(cudaGetDevice(&dev));
	CUDA(cudaDeviceSynchronize());

This isn’t much help to you obviously, but it’s interesting that this fixes the problem, don’t you think?

Hello - I’m also seeing this problem, which makes our code to query the free memory on a multi-kepler machine exceptionally slow. Does anyone know the progress of this, or if there’s somewhere else to look to track this bug? Thanks.