Initialization time on GTX 460

indigodrive · July 29, 2011, 2:16pm

I am finding that the initialization time on my card is pretty long, usually 2-3 seconds. This happens either on the first cudaMalloc or on cublasInit(). Is this normal? I gather from other forum posts that the initialization time should be at least an order of magnitude shorter, and in some cases 2 orders of magnitude (some people are reporting 40ms??). Is there anything that I can do to improve this time?

My card is a Geforce GTX 460, and I’m compiling with -arch sm_21. My driver is the latest one, 270.41.19, and I’ve tried both cuda toolkit 4.0.17 and 3.2.16 with no difference. I’m on CentOS 5.5 x86_64.

indigodrive · July 29, 2011, 3:17pm

It appears that the initialization time is caused by the X server running. When I am in runlevel 3, the init time is 1-2 ms!

EDIT: I take it back. After running several times, the card init time goes back up to 2-3 seconds. I never started the x server, so there must be some other problem.

dfranusic · August 6, 2011, 10:06am

If it only happens the first time and you are in runevel 3 then it is probably the missing nodes in /dev (nvidiaX and nvidiactl). You can run this in /etc/local.d.

#!/bin/bash

modprobe nvidia

if [ “$?” -eq 0 ]; then

Count the number of NVIDIA controllers found.

N=expr $N3D + $NVGA - 1
for i in seq 0 $N; do
mknod -m 666 /dev/nvidia$i c 195 $i;
done

mknod -m 666 /dev/nvidiactl c 195 255

set persistance

nvidia-smi -pm 1
else
exit 1
fi

indigodrive · November 2, 2011, 2:17pm

Thanks for the reply. Unfortunately that does not fix my problem. The kernel module is loaded, the device nodes exist. I tried running your script just to be sure, and it doesn’t make any difference.

Do you have any other ideas? I have upgraded to the latest driver (285.05.09) and it still has this problem.

mfatica · November 2, 2011, 2:18pm

Set persistent mode with nvidia-smi:
nvidia-smi -pm 1

indigodrive · November 2, 2011, 2:27pm

That was the last line of dfranusic’s script. I verified with “nvidia-smi -q” that persistence mode is enabled. It still takes ~3 seconds for my test code with a single cudaMalloc. Any other ideas?

mfatica · November 2, 2011, 2:57pm

Go back to 270.41.19, it works for me.

indigodrive · November 2, 2011, 3:13pm

I have downgraded to 270.41.19, still no luck. Can I give you any more info?

mfatica · November 2, 2011, 3:30pm

Are you using the script?
Could you post the output of nvidia-smi -q?

What is the output of “time deviceQuery -noprompt”?

indigodrive · November 2, 2011, 3:48pm

Before I answer all your questions, I should tell you that I booted with “acpi=off” and this problem seems to have gone away. So it looks like there is some conflict between acpi and the nvidia module. Now, to your questions:

Yes.

==============NVSMI LOG==============

Timestamp                       : Wed Nov  2 11:44:01 2011

Driver Version                  : 270.41.19

Attached GPUs                   : 1

GPU 0:8:0

    Product Name                : GeForce GTX 460

    Display Mode                : N/A

    Persistence Mode            : Enabled

    Driver Model

        Current                 : N/A

        Pending                 : N/A

    Serial Number               : N/A

    GPU UUID                    : N/A

    Inforom Version

        OEM Object              : N/A

        ECC Object              : N/A

        Power Management Object : N/A

    PCI

        Bus                     : 8

        Device                  : 0

        Domain                  : 0

        Device Id               : E2210DE

        Bus Id                  : 0:8:0

    Fan Speed                   : 48 %

    Memory Usage

        Total                   : 767 Mb

        Used                    : 3 Mb

        Free                    : 763 Mb

    Compute Mode                : Default

    Utilization

        Gpu                     : N/A

        Memory                  : N/A

    Ecc Mode

        Current                 : N/A

        Pending                 : N/A

    ECC Errors

        Volatile

            Single Bit

                Device Memory   : N/A

                Register File   : N/A

                L1 Cache        : N/A

                L2 Cache        : N/A

                Total           : N/A

            Double Bit

                Device Memory   : N/A

                Register File   : N/A

                L1 Cache        : N/A

                L2 Cache        : N/A

                Total           : N/A

        Aggregate

            Single Bit

                Device Memory   : N/A

                Register File   : N/A

                L1 Cache        : N/A

                L2 Cache        : N/A

                Total           : N/A

            Double Bit

                Device Memory   : N/A

                Register File   : N/A

                L1 Cache        : N/A

                L2 Cache        : N/A

                Total           : N/A

    Temperature

        Gpu                     : 38 C

    Power Readings

        Power State             : N/A

        Power Management        : N/A

        Power Draw              : N/A

        Power Limit             : N/A

    Clocks

        Graphics                : N/A

        SM                      : N/A

        Memory                  : N/A

[deviceQuery] starting...

./deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Found 1 CUDA Capable device(s)

Device 0: "GeForce GTX 460"

  CUDA Driver Version / Runtime Version          4.0 / 4.0

  CUDA Capability Major/Minor version number:    2.1

  Total amount of global memory:                 767 MBytes (804454400 bytes)

  ( 7) Multiprocessors x (48) CUDA Cores/MP:     336 CUDA Cores

  GPU Clock Speed:                               1.45 GHz

  Memory Clock rate:                             1800.00 Mhz

  Memory Bus Width:                              192-bit

  L2 Cache Size:                                 393216 bytes

  Max Texture Dimension Size (x,y,z)             1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)

  Max Layered Texture Size (dim) x layers        1D=(16384) x 2048, 2D=(16384,16384) x 2048

  Total amount of constant memory:               65536 bytes

  Total amount of shared memory per block:       49152 bytes

  Total number of registers available per block: 32768

  Warp size:                                     32

  Maximum number of threads per block:           1024

  Maximum sizes of each dimension of a block:    1024 x 1024 x 64

  Maximum sizes of each dimension of a grid:     65535 x 65535 x 65535

  Maximum memory pitch:                          2147483647 bytes

  Texture alignment:                             512 bytes

  Concurrent copy and execution:                 Yes with 1 copy engine(s)

  Run time limit on kernels:                     No

  Integrated GPU sharing Host Memory:            No

  Support host page-locked memory mapping:       Yes

  Concurrent kernel execution:                   Yes

  Alignment requirement for Surfaces:            Yes

  Device has ECC support enabled:                No

  Device is using TCC driver mode:               No

  Device supports Unified Addressing (UVA):      Yes

  Device PCI Bus ID / PCI location ID:           8 / 0

  Compute Mode:

     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4.0, CUDA Runtime Version = 4.0, NumDevs = 1, Device = GeForce GTX 460

[deviceQuery] test results...

PASSED

real    0m0.018s

user    0m0.001s

sys     0m0.015s

mfatica · November 2, 2011, 3:56pm

Persistent mode is working (real 0m0.018s).
There may be something wrong with your code.

indigodrive · November 2, 2011, 4:02pm

Here’s the code:

#include <cuda.h>

#include <stdio.h>

#include <sys/time.h>

int main(int argc, char** argv)

{

struct timeval start,finish;

  double time1;

  int i;

  int *dev_ptr;

for (i=0; i<5; i++) {  

      gettimeofday(&start,NULL);

      cudaMalloc( (void**)&dev_ptr, sizeof(int));

      gettimeofday(&finish,NULL);

time1 = finish.tv_sec-start.tv_sec+1e-6*(finish.tv_usec-start.tv_usec);

      printf("  Malloc %d took  %e seconds\n",i, time1);

  }

return 0;

}

Here’s the output:

Malloc 0 took  2.491571e+00 seconds

  Malloc 1 took  7.000000e-06 seconds

  Malloc 2 took  3.000000e-06 seconds

  Malloc 3 took  3.000000e-06 seconds

  Malloc 4 took  3.000000e-06 seconds

This is with ACPI. When I boot with “acpi=off” the first call takes 0.2 seconds.

mfatica · November 2, 2011, 6:24pm

ACPI may put the card in a power saving mode, let me ask around to see if this is the case.

The first Malloc has a time comparable to the time I reported for deviceQuery with the driver in persistent mode.

Malloc 0 took 1.149710e-01 seconds
Malloc 1 took 7.000000e-06 seconds
Malloc 2 took 4.000000e-06 seconds
Malloc 3 took 3.000000e-06 seconds
Malloc 4 took 3.000000e-06 seconds

indigodrive · November 2, 2011, 7:13pm

This is the case for me also when I have booted with “acpi=off” or also “acpi=ht”. But with acpi on it always takes around 3 seconds. I wonder if updating the bios on my motherboard has some chance of fixing this?

mfatica · November 2, 2011, 7:20pm

Updating the SBIOS may solve your problem.

On a local machine:

With ACPI ON + persistent mode ON

Malloc 0 took 7.774200e-02 seconds
Malloc 1 took 3.380000e-04 seconds
Malloc 2 took 3.340000e-04 seconds
Malloc 3 took 3.360000e-04 seconds
Malloc 4 took 3.720000e-04 seconds

With ACPI ON + persistent mode OFF

Malloc 0 took 6.010940e-01 seconds
Malloc 1 took 3.810000e-04 seconds
Malloc 2 took 3.700000e-04 seconds
Malloc 3 took 3.370000e-04 seconds
Malloc 4 took 3.370000e-04 seconds

indigodrive · November 3, 2011, 3:17am

I tried updating the BIOS for the motherboard tonight, no dice. It still takes ~3 seconds for the first cudaMalloc with ACPI on. Any other ideas?

indigodrive · November 9, 2011, 2:19pm

I finally found a BIOS setting that works. In my BIOS there is an option “ACPI APIC support”, I set that to disabled. Now the card initializes the same as when ACPI was completely turned off by the kernel.

indigodrive · November 9, 2011, 2:44pm

Unfortunately, multi-core performance is lousy with APIC turned off. So perhaps there is a bug in the nvidia driver? Should it matter what kernel I’m running? I’m using Centos 6 which has 2.6.32-71.29.1.el6.x86_64 as the kernel.

Topic		Replies	Views
Runtime initialization slow (1 sec) on 400-500 series cards, very slow (5 sec) with CUDA 3.2 CUDA Programming and Performance	5	5645	April 22, 2011
cuInit taking a long time? cuInit taking a second CUDA Programming and Performance	6	5370	November 14, 2007
Slow CUDA programs' startup CUDA Programming and Performance	10	7367	January 23, 2012
Strange delay on CUDA initialization CUDA Programming and Performance	6	20661	November 30, 2011
Help! First cudaMalloc takes 10 seconds! CUDA Programming and Performance	8	1585	February 11, 2012
cudaMalloc taking 4 seconds CUDA Programming and Performance	4	850	November 23, 2011
CUDA setup times (create context, malloc, destroy context) some measurements included CUDA Programming and Performance	19	23261	July 8, 2011
Long initialization time C1060 CUDA Programming and Performance	3	1194	August 6, 2009
cuda device initialization very slow in ubuntu 8.04 with new driver different driver / card combos t CUDA Programming and Performance	0	12081	December 17, 2010
HELP: cuda runtime initialization takes up to minutes CUDA Programming and Performance	2	7867	June 21, 2011

Initialization time on GTX 460

Count the number of NVIDIA controllers found.

set persistance

Related topics