I am finding that the initialization time on my card is pretty long, usually 2-3 seconds. This happens either on the first cudaMalloc or on cublasInit(). Is this normal? I gather from other forum posts that the initialization time should be at least an order of magnitude shorter, and in some cases 2 orders of magnitude (some people are reporting 40ms??). Is there anything that I can do to improve this time?
My card is a Geforce GTX 460, and I’m compiling with -arch sm_21. My driver is the latest one, 270.41.19, and I’ve tried both cuda toolkit 4.0.17 and 3.2.16 with no difference. I’m on CentOS 5.5 x86_64.
It appears that the initialization time is caused by the X server running. When I am in runlevel 3, the init time is 1-2 ms!
EDIT: I take it back. After running several times, the card init time goes back up to 2-3 seconds. I never started the x server, so there must be some other problem.
If it only happens the first time and you are in runevel 3 then it is probably the missing nodes in /dev (nvidiaX and nvidiactl). You can run this in /etc/local.d.
Thanks for the reply. Unfortunately that does not fix my problem. The kernel module is loaded, the device nodes exist. I tried running your script just to be sure, and it doesn’t make any difference.
Do you have any other ideas? I have upgraded to the latest driver (285.05.09) and it still has this problem.
That was the last line of dfranusic’s script. I verified with “nvidia-smi -q” that persistence mode is enabled. It still takes ~3 seconds for my test code with a single cudaMalloc. Any other ideas?
Before I answer all your questions, I should tell you that I booted with “acpi=off” and this problem seems to have gone away. So it looks like there is some conflict between acpi and the nvidia module. Now, to your questions:
Yes.
==============NVSMI LOG==============
Timestamp : Wed Nov 2 11:44:01 2011
Driver Version : 270.41.19
Attached GPUs : 1
GPU 0:8:0
Product Name : GeForce GTX 460
Display Mode : N/A
Persistence Mode : Enabled
Driver Model
Current : N/A
Pending : N/A
Serial Number : N/A
GPU UUID : N/A
Inforom Version
OEM Object : N/A
ECC Object : N/A
Power Management Object : N/A
PCI
Bus : 8
Device : 0
Domain : 0
Device Id : E2210DE
Bus Id : 0:8:0
Fan Speed : 48 %
Memory Usage
Total : 767 Mb
Used : 3 Mb
Free : 763 Mb
Compute Mode : Default
Utilization
Gpu : N/A
Memory : N/A
Ecc Mode
Current : N/A
Pending : N/A
ECC Errors
Volatile
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Total : N/A
Aggregate
Single Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Total : N/A
Double Bit
Device Memory : N/A
Register File : N/A
L1 Cache : N/A
L2 Cache : N/A
Total : N/A
Temperature
Gpu : 38 C
Power Readings
Power State : N/A
Power Management : N/A
Power Draw : N/A
Power Limit : N/A
Clocks
Graphics : N/A
SM : N/A
Memory : N/A
[deviceQuery] starting...
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Found 1 CUDA Capable device(s)
Device 0: "GeForce GTX 460"
CUDA Driver Version / Runtime Version 4.0 / 4.0
CUDA Capability Major/Minor version number: 2.1
Total amount of global memory: 767 MBytes (804454400 bytes)
( 7) Multiprocessors x (48) CUDA Cores/MP: 336 CUDA Cores
GPU Clock Speed: 1.45 GHz
Memory Clock rate: 1800.00 Mhz
Memory Bus Width: 192-bit
L2 Cache Size: 393216 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65535), 3D=(2048,2048,2048)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and execution: Yes with 1 copy engine(s)
Run time limit on kernels: No
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Concurrent kernel execution: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support enabled: No
Device is using TCC driver mode: No
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 8 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 4.0, CUDA Runtime Version = 4.0, NumDevs = 1, Device = GeForce GTX 460
[deviceQuery] test results...
PASSED
real 0m0.018s
user 0m0.001s
sys 0m0.015s
#include <cuda.h>
#include <stdio.h>
#include <sys/time.h>
int main(int argc, char** argv)
{
struct timeval start,finish;
double time1;
int i;
int *dev_ptr;
for (i=0; i<5; i++) {
gettimeofday(&start,NULL);
cudaMalloc( (void**)&dev_ptr, sizeof(int));
gettimeofday(&finish,NULL);
time1 = finish.tv_sec-start.tv_sec+1e-6*(finish.tv_usec-start.tv_usec);
printf(" Malloc %d took %e seconds\n",i, time1);
}
return 0;
}
Here’s the output:
Malloc 0 took 2.491571e+00 seconds
Malloc 1 took 7.000000e-06 seconds
Malloc 2 took 3.000000e-06 seconds
Malloc 3 took 3.000000e-06 seconds
Malloc 4 took 3.000000e-06 seconds
This is with ACPI. When I boot with “acpi=off” the first call takes 0.2 seconds.
ACPI may put the card in a power saving mode, let me ask around to see if this is the case.
The first Malloc has a time comparable to the time I reported for deviceQuery with the driver in persistent mode.
Malloc 0 took 1.149710e-01 seconds
Malloc 1 took 7.000000e-06 seconds
Malloc 2 took 4.000000e-06 seconds
Malloc 3 took 3.000000e-06 seconds
Malloc 4 took 3.000000e-06 seconds
This is the case for me also when I have booted with “acpi=off” or also “acpi=ht”. But with acpi on it always takes around 3 seconds. I wonder if updating the bios on my motherboard has some chance of fixing this?
Malloc 0 took 7.774200e-02 seconds
Malloc 1 took 3.380000e-04 seconds
Malloc 2 took 3.340000e-04 seconds
Malloc 3 took 3.360000e-04 seconds
Malloc 4 took 3.720000e-04 seconds
With ACPI ON + persistent mode OFF
Malloc 0 took 6.010940e-01 seconds
Malloc 1 took 3.810000e-04 seconds
Malloc 2 took 3.700000e-04 seconds
Malloc 3 took 3.370000e-04 seconds
Malloc 4 took 3.370000e-04 seconds
I finally found a BIOS setting that works. In my BIOS there is an option “ACPI APIC support”, I set that to disabled. Now the card initializes the same as when ACPI was completely turned off by the kernel.
Unfortunately, multi-core performance is lousy with APIC turned off. So perhaps there is a bug in the nvidia driver? Should it matter what kernel I’m running? I’m using Centos 6 which has 2.6.32-71.29.1.el6.x86_64 as the kernel.