Initialization time on GTX 460

I am finding that the initialization time on my card is pretty long, usually 2-3 seconds. This happens either on the first cudaMalloc or on cublasInit(). Is this normal? I gather from other forum posts that the initialization time should be at least an order of magnitude shorter, and in some cases 2 orders of magnitude (some people are reporting 40ms??). Is there anything that I can do to improve this time?

My card is a Geforce GTX 460, and I’m compiling with -arch sm_21. My driver is the latest one, 270.41.19, and I’ve tried both cuda toolkit 4.0.17 and 3.2.16 with no difference. I’m on CentOS 5.5 x86_64.

It appears that the initialization time is caused by the X server running. When I am in runlevel 3, the init time is 1-2 ms!

EDIT: I take it back. After running several times, the card init time goes back up to 2-3 seconds. I never started the x server, so there must be some other problem.

If it only happens the first time and you are in runevel 3 then it is probably the missing nodes in /dev (nvidiaX and nvidiactl). You can run this in /etc/local.d.

#!/bin/bash

modprobe nvidia

if [ “$?” -eq 0 ]; then

Count the number of NVIDIA controllers found.

N3D=lspci | grep -i NVIDIA | grep "3D controller" | wc -l
NVGA=lspci | grep -i NVIDIA | grep "VGA compatible controller" | wc -l

N=expr $N3D + $NVGA - 1
for i in seq 0 $N; do
mknod -m 666 /dev/nvidia$i c 195 $i;
done

mknod -m 666 /dev/nvidiactl c 195 255

set persistance

nvidia-smi -pm 1
else
exit 1
fi

Thanks for the reply. Unfortunately that does not fix my problem. The kernel module is loaded, the device nodes exist. I tried running your script just to be sure, and it doesn’t make any difference.

Do you have any other ideas? I have upgraded to the latest driver (285.05.09) and it still has this problem.

Set persistent mode with nvidia-smi:
nvidia-smi -pm 1

That was the last line of dfranusic’s script. I verified with “nvidia-smi -q” that persistence mode is enabled. It still takes ~3 seconds for my test code with a single cudaMalloc. Any other ideas?

Go back to 270.41.19, it works for me.

I have downgraded to 270.41.19, still no luck. Can I give you any more info?

Are you using the script?
Could you post the output of nvidia-smi -q?

What is the output of “time deviceQuery -noprompt”?

Before I answer all your questions, I should tell you that I booted with “acpi=off” and this problem seems to have gone away. So it looks like there is some conflict between acpi and the nvidia module. Now, to your questions:

Yes.

==============NVSMI LOG==============

Timestamp                       : Wed Nov  2 11:44:01 2011

Driver Version                  : 270.41.19

Attached GPUs                   : 1

GPU 0:8:0

    Product Name                : GeForce GTX 460

    Display Mode                : N/A

    Persistence Mode            : Enabled

    Driver Model

        Current                 : N/A

        Pending                 : N/A

    Serial Number               : N/A

    GPU UUID                    : N/A

    Inforom Version

        OEM Object              : N/A

        ECC Object              : N/A

        Power Management Object : N/A

    PCI

        Bus                     : 8

        Device                  : 0

        Domain                  : 0

        Device Id               : E2210DE

        Bus Id                  : 0:8:0

    Fan Speed                   : 48 %

    Memory Usage

        Total                   : 767 Mb

        Used                    : 3 Mb

        Free                    : 763 Mb

    Compute Mode                : Default

    Utilization

        Gpu                     : N/A

        Memory                  : N/A

    Ecc Mode

        Current                 : N/A

        Pending                 : N/A

    ECC Errors

        Volatile

            Single Bit

                Device Memory   : N/A

                Register File   : N/A

                L1 Cache        : N/A

                L2 Cache        : N/A

                Total           : N/A

            Double Bit

                Device Memory   : N/A

                Register File   : N/A

                L1 Cache        : N/A

                L2 Cache        : N/A

                Total           : N/A

        Aggregate

            Single Bit

                Device Memory   : N/A

                Register File   : N/A

                L1 Cache        : N/A

                L2 Cache        : N/A

                Total           : N/A

            Double Bit

                Device Memory   : N/A

                Register File   : N/A

                L1 Cache        : N/A

                L2 Cache        : N/A

                Total           : N/A

    Temperature

        Gpu                     : 38 C

    Power Readings

        Power State             : N/A

        Power Management        : N/A

        Power Draw              : N/A

        Power Limit             : N/A

    Clocks

        Graphics                : N/A

        SM                      : N/A

        Memory                  : N/A
[deviceQuery] starting...

./deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Found 1 CUDA Capable device(s)

Device 0: "GeForce GTX 460"

  CUDA Driver Version / Runtime Version          4.0 / 4.0

  CUDA Capability Major/Minor version number:    2.1

  Total amount of global memory:                 767 MBytes (804454400 bytes)

  ( 7) Multiprocessors x (48) CUDA Cores/MP:     336 CUDA Cores

  GPU Clock Speed:                               1.45 GHz

  Memory Clock rate:                             1800.00 Mhz

  Memory Bus Width:                              192-bit

  L2 Cache Size:                                 393216 bytes

  Max Texture Dimension Size (x,y,z)             1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)

  Max Layered Texture Size (dim) x layers        1D=(16384) x 2048, 2D=(16384,16384) x 2048

  Total amount of constant memory:               65536 bytes

  Total amount of shared memory per block:       49152 bytes

  Total number of registers available per block: 32768

  Warp size:                                     32

  Maximum number of threads per block:           1024

  Maximum sizes of each dimension of a block:    1024 x 1024 x 64

  Maximum sizes of each dimension of a grid:     65535 x 65535 x 65535

  Maximum memory pitch:                          2147483647 bytes

  Texture alignment:                             512 bytes

  Concurrent copy and execution:                 Yes with 1 copy engine(s)

  Run time limit on kernels:                     No

  Integrated GPU sharing Host Memory:            No

  Support host page-locked memory mapping:       Yes

  Concurrent kernel execution:                   Yes

  Alignment requirement for Surfaces:            Yes

  Device has ECC support enabled:                No

  Device is using TCC driver mode:               No

  Device supports Unified Addressing (UVA):      Yes

  Device PCI Bus ID / PCI location ID:           8 / 0

  Compute Mode:

     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4.0, CUDA Runtime Version = 4.0, NumDevs = 1, Device = GeForce GTX 460

[deviceQuery] test results...

PASSED

real    0m0.018s

user    0m0.001s

sys     0m0.015s

Persistent mode is working (real 0m0.018s).
There may be something wrong with your code.

Here’s the code:

#include <cuda.h>

#include <stdio.h>

#include <sys/time.h>

int main(int argc, char** argv)

{

struct timeval start,finish;

  double time1;

  int i;

  int *dev_ptr;

for (i=0; i<5; i++) {  

      gettimeofday(&start,NULL);

      cudaMalloc( (void**)&dev_ptr, sizeof(int));

      gettimeofday(&finish,NULL);

time1 = finish.tv_sec-start.tv_sec+1e-6*(finish.tv_usec-start.tv_usec);

      printf("  Malloc %d took  %e seconds\n",i, time1);

  }

return 0;

}

Here’s the output:

Malloc 0 took  2.491571e+00 seconds

  Malloc 1 took  7.000000e-06 seconds

  Malloc 2 took  3.000000e-06 seconds

  Malloc 3 took  3.000000e-06 seconds

  Malloc 4 took  3.000000e-06 seconds

This is with ACPI. When I boot with “acpi=off” the first call takes 0.2 seconds.

ACPI may put the card in a power saving mode, let me ask around to see if this is the case.

The first Malloc has a time comparable to the time I reported for deviceQuery with the driver in persistent mode.

Malloc 0 took 1.149710e-01 seconds
Malloc 1 took 7.000000e-06 seconds
Malloc 2 took 4.000000e-06 seconds
Malloc 3 took 3.000000e-06 seconds
Malloc 4 took 3.000000e-06 seconds

This is the case for me also when I have booted with “acpi=off” or also “acpi=ht”. But with acpi on it always takes around 3 seconds. I wonder if updating the bios on my motherboard has some chance of fixing this?

Updating the SBIOS may solve your problem.

On a local machine:

With ACPI ON + persistent mode ON

Malloc 0 took 7.774200e-02 seconds
Malloc 1 took 3.380000e-04 seconds
Malloc 2 took 3.340000e-04 seconds
Malloc 3 took 3.360000e-04 seconds
Malloc 4 took 3.720000e-04 seconds

With ACPI ON + persistent mode OFF

Malloc 0 took 6.010940e-01 seconds
Malloc 1 took 3.810000e-04 seconds
Malloc 2 took 3.700000e-04 seconds
Malloc 3 took 3.370000e-04 seconds
Malloc 4 took 3.370000e-04 seconds

I tried updating the BIOS for the motherboard tonight, no dice. It still takes ~3 seconds for the first cudaMalloc with ACPI on. Any other ideas?

I finally found a BIOS setting that works. In my BIOS there is an option “ACPI APIC support”, I set that to disabled. Now the card initializes the same as when ACPI was completely turned off by the kernel.

Unfortunately, multi-core performance is lousy with APIC turned off. So perhaps there is a bug in the nvidia driver? Should it matter what kernel I’m running? I’m using Centos 6 which has 2.6.32-71.29.1.el6.x86_64 as the kernel.