Hi,
The time for cuInit() or cudaSetDevice() to return on our Tesla K20c Kepler GPUs is about 10 to 20 (!) times slower than on our Tesla S2050 Fermi GPUs. Yes, we have persistence mode enabled, have tried the nvidia-smi loop thing, and the latest driver. Does anyone have an idea as to why it’s so slow? Is this a driver issue, an OS issue, or could there be some hardware issue at play here? It’s definitely a degradation from our previous platform. The bandwidth test in the sdk shows much better performance in all areas with the Keplers, but initialization is horribly [url][/url]slow.
I took the sdk’s deviceQueryDrv and modified it simply to time the cuInit() call as shown below, then wrote a simple script to spawn 1 to N of these apps (also below). Here’s the output:
On Fermi:
Starting…
cuInit time = 0.202069
On Kepler:
Starting…
cuInit time = 1.753236
We have many multi-threaded applications (maybe as many as 20) that may start up simultaneously (roughly), we can tolerate some delay, but not that much. Again, this works fine on our Fermi platform.
If you run N instance, say 10, the Fermi stays relatively fast, whereas the Kepler simply bogs down – all of the instances get slow, not progressively slow like the Fermis.
(10 apps started at roughly the same time):
On Fermi:
Starting…
cuInit time = 0.200683
Starting…
Starting…
Starting…
Starting…
Starting…
Starting…
Starting…
Starting…
Starting…
cuInit time = 0.200975
cuInit time = 0.557699
cuInit time = 0.941136
cuInit time = 1.026017
cuInit time = 1.166268
cuInit time = 1.186676
cuInit time = 1.287552
cuInit time = 1.319250
cuInit time = 1.470359
On Kepler:
Starting…
Starting…
Starting…
Starting…
Starting…
Starting…
Starting…
Starting…
Starting…
Starting…
cuInit time = 18.169408
cuInit time = 19.190488
cuInit time = 19.949257
cuInit time = 20.858861
cuInit time = 21.737321
cuInit time = 22.568534
cuInit time = 23.402647
cuInit time = 24.295237
cuInit time = 24.289321
cuInit time = 24.282347
Egad! Look how slow the Kepler is! Adding “strace –tt” to the front of this app shows that most of the time is spent calling ioctl, to the tune of 9000+ times for a single call to cuInit() (yes, over 92,000 times if you spawn 10 instances).
I see this line over 9000 times in the strace output:
ioctl(3, 0xc0a0464a, 0x7fff76c93870) = 0[b][/b]
What gives?
Some HW information:
Fermi hardware:
- X86_64 running RHEL 5.5
- 32 cores, 64GB memory
- CUDA 4.0 (I think)
- Nvidia-smi –q snippet:
Driver Version : 270.41.19
Attached GPUs : 4
GPU 0:A:0
Product Name : Tesla T20 Processor
Display Mode : Disabled
Persistence Mode : Enabled
Kepler hardware:
- X86_64 running RHEL 6.3
- 32 cores, 256GB memory
- CUDA 5.0
- Nvidia-smi –q snippet:
Driver Version : 331.20
Attached GPUs : 4
GPU 0000:89:00.0
Product Name : Tesla K20c
Display Mode : Disabled
Display Active : Disabled
Persistence Mode : Enabled
Accounting Mode : Disabled
I’ve attached the sample program and run script - can someone try it on their Kepler/Fermi? I can’t generate an nvidia-bug-report since my sysadmin isn’t here and I don’t have root permission.
Thanks,
Rick
Here’s the simple app (use the deviceQueryDrv makefile in your CUDA 5.0 sdk):
// includes, system
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <sys/time.h>
#include <cuda.h>
#include <helper_cuda_drvapi.h>
#include <drvapi_error_string.h>
////////////////////////////////////////////////////////////////////////////////
// Program main
////////////////////////////////////////////////////////////////////////////////
int
main(int argc, char **argv)
{
CUdevice dev;
struct timeval timeStart, timeEnd;
fprintf(stdout,"Starting...\n");
fflush(stdout);
(void)gettimeofday(&timeStart, NULL);
CUresult error_id = cuInit(0);
(void)gettimeofday(&timeEnd, NULL);
fprintf(stdout,"cuInit time = %lf\n", ((timeEnd.tv_sec+(timeEnd.tv_usec*0.000001)) - (timeStart.tv_sec +(timeStart.tv_usec*0.000001))));
fflush(stdout);
if (error_id != CUDA_SUCCESS)
{
printf("cuInit(0) returned %d\n-> %s\n", error_id, getCudaDrvErrorString(error_id));
exit(EXIT_FAILURE);
}
exit(EXIT_SUCCESS);
}
Here’s the simple shell script to spawn 1 to N:
#!/bin/bash
Script to spawn 1 to N of the above apps
num=$1
if [ “$num” != “” ] ; then
while [ $num -gt 0 ] ; do
./deviceInitTimeDrv &
num=expr $num - 1
done
else
./deviceInitTimeDrv
fi