cudamalloc error in tesla P4 card

we have eight P4 card on our server

[root@localhost ~]# nvidia-smi
Tue Apr 23 14:15:28 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104      Driver Version: 410.104      CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P4            Off  | 00000000:18:00.0 Off |                    0 |
| N/A   72C    P0    28W /  75W |   5413MiB /  7611MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla P4            Off  | 00000000:19:00.0 Off |                    0 |
| N/A   71C    P0    26W /  75W |   5413MiB /  7611MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   2  Tesla P4            Off  | 00000000:5F:00.0 Off |                    0 |
| N/A   68C    P0    25W /  75W |   5413MiB /  7611MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   3  Tesla P4            Off  | 00000000:86:00.0 Off |                    2 |
| N/A   39C    P8     7W /  75W |      0MiB /  7611MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   4  Tesla P4            Off  | 00000000:87:00.0 Off |                    0 |
| N/A   72C    P0    27W /  75W |   5413MiB /  7611MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   5  Tesla P4            Off  | 00000000:AF:00.0 Off |                    0 |
| N/A   60C    P0    25W /  75W |   5413MiB /  7611MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   6  Tesla P4            Off  | 00000000:B0:00.0 Off |                    0 |
| N/A   62C    P0    25W /  75W |   5413MiB /  7611MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   7  Tesla P4            Off  | 00000000:D8:00.0 Off |                    0 |
| N/A   70C    P0    25W /  75W |   5413MiB /  7611MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0    136366      C   java                                        5403MiB |
|    1    152242      C   java                                        5403MiB |
|    2    158402      C   java                                        5403MiB |
|    4    140028      C   java                                        5403MiB |
|    5    147174      C   java                                        5403MiB |
|    6    141910      C   java                                        5403MiB |
|    7    144020      C   java                                        5403MiB |
+-----------------------------------------------------------------------------+

but something seems wrong with No.3 card.

I write a small demo,code is here

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <sstream>

using namespace std;

int dataSize = 16;

int main(int argc, char*argv[])
{
	if(argc < 3)
	{
		printf("please enter device dataSize\n");
		return 0;
	}
	
	int device = 0;
	std::stringstream convert;
	convert << argv[1];
	convert >> device;
	
	convert.clear();
	convert << argv[2];
	convert >> dataSize;
	
	if(cudaSuccess != cudaSetDevice(device))
	{
		printf("cuda set device error\n");
		return -1;
	}

	int * pGpuDistance;
	if(cudaSuccess != cudaMalloc((void **)&pGpuDistance, sizeof(int)*dataSize))
	{
		printf("cuda set device error\n");
		return -1;
	}

	if(cudaSuccess != cudaFree(pGpuDistance))
	{
		printf("cuda free error\n");
		return -1;
	}
	
	if(cudaSuccess != cudaDeviceReset())
	{
		printf("cuda free error\n");
		return -1;
	}


	return 0;
}

when i run this demo, No.3 card seems not OK.

[root@localhost ~]# ./a.out 3 100
段错误
[root@localhost ~]# ./a.out 2 100
您在 /var/spool/mail/root 中有新邮件
[root@localhost ~]# ./a.out 4 100

linux version is

[root@localhost ~]# cat /etc/redhat-release 
CentOS Linux release 7.4.1708 (Core)

Can anyone tells me what’s wrong?

Thanks!!

If you switch the third P4 card with other cards, what happen?

I used this command and the problem was solved.

nvidia-smi -i 3 -e 0

ECC state of ‘2’ means : the number of uncorrectable ECC errors you’ve had

I think that ECC state of ‘2’ means cudaErrorMemoryAllocation.

Am I right?

ECC is the generic name used for error correcting codes that can be used to detect and possibly correct read errors from RAM
You can get more info from this page.
https://devtalk.nvidia.com/default/topic/1035722/cuda-programming-and-performance/strange-ecc-mode-reported-by-nvidia-smi-exe/

Thanks for this information, very helpful!

But I wonder what can I do when I come across this problem again?