nvidia-smi on Amazon EC2 Cannot stop ECC for second GPU

gserban · February 2, 2011, 5:59pm

Hi,

I recently began using Amazon EC2 as a testbed for multi-GPU computations.
Since I currently do not neet the ECC feature, I stop it to gain more performance. However, I was very surprized (and annoyed) by the fact that I am only able to do this for the first GPU. Both GPUs are identical, M2050, as reported by deviceQuery:

./deviceQuery Starting…

CUDA Device Query (Runtime API) version (CUDART static linking)

There are 2 devices supporting CUDA

Device 0: “Tesla M2050”
CUDA Driver Version: 3.20
CUDA Runtime Version: 3.20
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 2817982464 bytes
Multiprocessors x Cores/MP = Cores: 14 (MP) x 32 (Cores/MP) = 448 (Cores)
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Clock rate: 1.15 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)
Concurrent kernel execution: Yes
Device has ECC support enabled: Yes
Device is using TCC driver mode: No

Device 1: “Tesla M2050”
CUDA Driver Version: 3.20
CUDA Runtime Version: 3.20
CUDA Capability Major/Minor version number: 2.0
Total amount of global memory: 2817982464 bytes
Multiprocessors x Cores/MP = Cores: 14 (MP) x 32 (Cores/MP) = 448 (Cores)
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 32768
Warp size: 32
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Clock rate: 1.15 GHz
Concurrent copy and execution: Yes
Run time limit on kernels: No
Integrated: No
Support host page-locked memory mapping: Yes
Compute mode: Default (multiple host threads can use this device simultaneously)
Concurrent kernel execution: Yes
Device has ECC support enabled: Yes
Device is using TCC driver mode: No

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 3.20, CUDA Runtime Version = 3.20, NumDevs = 2, Device = Tesla M2050, Device = Tesla M2050

PASSED

Press to Quit…

For nvidia-smi, I get the following:

nvidia-smi -r
ECC configuration for GPU 0:
Current: 1
After reboot: 1
ECC is not supported by GPU 1

and, needless to say, when I try to
nvidia-smi -g 0 --ecc-config=1
everything works, but when I try to
nvidia-smi -g 1 --ecc-config=0
i get
ECC is not supported by GPU 1 or the ECC configuration cannot be changed

Has anybody seen this problem before? Is there a solution?

Cheers,
Serban

tmurray · February 2, 2011, 6:33pm

can you file a bug with Amazon?

gserban · February 2, 2011, 7:46pm

Just reported the issue, hope it doesn’t take to long …

Robert_Crovella · February 4, 2011, 5:18pm

I don’t see the driver version you are using listed in your post (maybe I missed it.) If you’re not using the latest driver version could you retry your case with the latest Tesla M2050 driver posted to nvidia.com (right now it appears to me to be 260.1936) Linux x64 (AMD64/EM64T) Display Driver | 260.19.36 | Linux 64-bit | NVIDIA

gserban · February 4, 2011, 9:38pm

Hey, thanks a lot! That did the trick !

The thing is, I had the latest developer driver from here http://developer.nvidia.com/object/cuda_3_0_downloads.html#Linux, that is http://developer.download.nvidia.com/compute/cuda/3_0/drivers/devdriver_3.0_linux_64_195.36.15.run, which is 260.19.26. With the updated driver all works correctly:

nvidia-smi -r

ECC configuration for GPU 0:

Current: 0

After reboot: 0

ECC configuration for GPU 1:

Current: 0

After reboot: 0

That’s somewhat weird, that the latest developer driver is outdated … anyway, now both GPUs are performing well. So problem solved.

Cheers,

Serban

Topic		Replies	Views
Results of running "deviceQuery" on Amazon EC2 GPU Instance Output of running the command de CUDA Programming and Performance	0	12874	February 15, 2011
nvidia-smi reports 3 GPUs but deviceQuery reports only 2 CUDA Setup and Installation	4	2011	June 23, 2018
CUDA test performance issue CUDA Programming and Performance	7	1445	November 24, 2014
deviceQuery and deviceQueryDrv pass other CUDA programs fail CUDA Setup and Installation	3	1856	November 13, 2013
Nvidia-smi.exe can detect two devices but devicequery.exe can do only one on Win7 CUDA Setup and Installation	0	1203	July 26, 2016
Problem with cudaGetDeviceCount returned 802 error Linux cuda	6	2633	December 28, 2024
Bug in basic arithmetic on M2050 CUDA Programming and Performance	9	10609	November 29, 2010
Error when nvidia-smi command is executed CUDA Programming and Performance	3	2322	May 22, 2018
pi cuda CUDA Programming and Performance	1	2049	March 27, 2011
K20 with high utilization, but no compute processes. CUDA Setup and Installation	12	26645	March 19, 2015

nvidia-smi on Amazon EC2 Cannot stop ECC for second GPU

Press to Quit…

Related topics