Several failures when running Memory test on CentOS7 machine with 8 K80s.

These are are two new machines and both are experiencing the same issues. They’re running the cuda 8 tooklit with the 375.26 driver installed. They’re running CentOS7 with 8 K80s (16 gpus) each, and we’re running this software.

https://github.com/lichinka/cuda-stress/blob/master/cudamemtest-1.2.3/cuda_memtest.cu

There are a few more I believe. It’s sort of a medley of compiled memory tests that we’ve been running for years, so it’s grown to a suite that just makes sure the cards are good. I’ll add as I go the scripts that we’re running to verify memory integrity. If it were just one machine or one card, I’d just say it was a bad card. But it doesn’t seem likely that’s the culprit here.

These are the errors that, at different times, we’ve encountered on

ERROR: Some GPU threads are not progressing (healthy_threads=15, num_gpus=16) <i><b>((The other machine frequently reports a [i]different</i> number of healthy threads on test re-runs.))</b>[/i]

Error CUDA error: unload of CUDA runtime failed, line 314, file tests.cu

Error CUDA error:  invalid device ordinal

Some of those failures/results are from non-nvidia software/scripts, and I appreciate that maybe I can’t get help with those. But does anyone have experience with this sort of machine configuration and why it might be throwing these errors that our other machine configus (fewer k80s, running centos6) do not experience?

Given the scant information one can only guess. I am not familiar with cuda_memtest. The software itself may be flaky.

(1) Was this K80-based system put together my an integrator that is an official NVIDIA partner? What make and model is this machine? List of NVIDIA partners: http://www.nvidia.com/object/where-to-buy-tesla.html

(2) K80 is a passively cooled device that has specific requirements for airflow past its heat sink (see specifications). Is that guaranteed in your system? Is the ambient temperature and altitude within specified the operating range? What GPU temperatures does nvidia-smi report?

(3) Is the power supply adequate? Ideally, the PSU should be sized such that the sum of the power ratings of all system components does not exceed 60% of the PSU’s power rating. Use efficient, high-quality PSUs only, preferably 80 PLUS Titanium compliant units or at least 80 PLUS Platinum compliant ones. Given that each K80 has a 300W power rating, total nominal system power is probably around 2700W, so ideally you’d have a power supply of 4500W. What does the machine actually have?

There may be other issues that interfere with proper operation of such machine such as an electromagnetically noisy environment (e.g. installation on a factory floor), mechanical vibrations, use of PCIe riser cards.

I appreciate the information is scant. I also appreciate that the error codes are coming from non-nvidia software/scripts. We’ve been running this for a while, though that’s not itself worthwhile testimony.

I’ll fill in all the gaps I can. I know it’s far beyond reasonable to ask anyone to peruse open source software looking for potential causes of memory issues.

Here, however is the relevant “unload of cuda runtime failed” error.

unsigned int
move_inv_test(char* ptr, unsigned int tot_num_blocks, unsigned int p1, unsigned p2)
{
    
    unsigned int i;
    unsigned int err = 0;
    char* end_ptr = ptr + tot_num_blocks* BLOCKSIZE;
    
    for (i= 0;i < tot_num_blocks; i+= GRIDSIZE){
	dim3 grid;
	grid.x= GRIDSIZE;
	kernel_move_inv_write<<<grid, 1>>>(ptr + i*BLOCKSIZE, end_ptr, p1); SYNC_CUERR;
	SHOW_PROGRESS("move_inv_write", i, tot_num_blocks);	
    }
    
    
    for (i=0;i < tot_num_blocks; i+= GRIDSIZE){
	dim3 grid;
	grid.x= GRIDSIZE;
<b><i>	kernel_move_inv_readwrite<<<grid, 1>>>(ptr + i*BLOCKSIZE, end_ptr, p1, p2, err_count, err_addr, err_expect, err_current, err_second_read); SYNC_CUERR;</i></b>
	err += error_checking("move_inv_readwrite",  i);	
	SHOW_PROGRESS("move_inv_readwrite", i, tot_num_blocks);	
    }
    
    for (i=0;i < tot_num_blocks; i+= GRIDSIZE){
	dim3 grid;
	grid.x= GRIDSIZE;
	kernel_move_inv_read<<<grid, 1>>>(ptr + i*BLOCKSIZE, end_ptr, p2, err_count, err_addr, err_expect, err_current, err_second_read); SYNC_CUERR;
	err += error_checking("move_inv_read",  i);	
	SHOW_PROGRESS("move_inv_read", i, tot_num_blocks);	
    }
        
    return err;
    
}

(1) Machine is put together by Microway. I believe they’re partnered with NVIDIA. I don’t know the microway model offhand, but the motherboard is a supermicro X10DRG with dual e5-2650 cpus and 256gb of memory.

(2) Temperature control seems fine. They idle around 28-30, and in use seem only to climb to as high as ~55 as reported by nvidia-smi’s tool.

(3) The power supply I will confirm but I do believe it is sufficient.

I suppose I may have just been hopeful to run into someone who’d encountered an exactly similar situation before.

new machines with K80’s seems very odd

I’m surprised that anyone would buy new machines with K80s.

Anyway, you might just want to go back to Microway and tell them that these new machines are not working according to your expectations.

Yes, Microway is an official partner for NVIDIA Tesla-based systems. They are on the current list (see URL above) and have been on it for years, although I couldn’t say exactly for how long (based on my recollection, probably at least five years at this point).

Given that, don’t worry about tracking down the power supply rating, as official integrators should be very familiar with all the requirements of building a robust Tesla-based system. Many questions about flaky Tesla-based systems posted in these forums are from people who attempt to build their own system and run into trouble; building a system with many Tesla GPUs requires non-trivial expertise. Thus my pre-emptive question about power supply.

Since you bought from an official integrator, you should have technical support through the integrator, and therefore my suggestion would be to contact the vendor. They are most familiar with the details of the system they shipped to you, may have seen similar issues before, know what to double check (e.g. system BIOS settings), and can replace defective parts (if any).

That’s very fair and I will pursue that avenue as well.

However, there is one error that I think might be relevant.

ERROR: CUDA error: invalid device ordinal, line 146, file cuda_memtest.cu

That line is

thread_func(void* _arg)
{
    
    arg_t* arg = (arg_t*)_arg;
    unsigned int device = arg->device;
    gpu_idx = device;

struct cudaDeviceProp prop;
    <i><b>cudaGetDeviceProperties(&prop, device); CUERR;</b></i>

    display_device_info(&prop);

    unsigned long totmem = prop.totalGlobalMem;
 
    PRINTF("major=%d, minor=%d\n", prop.major, prop.minor);
   
    //need to leave a little headroom or later calls will fail
    unsigned int tot_num_blocks = totmem/BLOCKSIZE -16;
    if (max_num_blocks != 0){
	tot_num_blocks = MIN(max_num_blocks+16, tot_num_blocks);
    }

I can run /usr/local/cuda/extras/demo_suite/devicequery and it seems to report all 16 GPUs. However here it fails. Is there a path I can take within the cuda suite of tools/tests to determine why this could be happening?

From the page about cudaGetDeviceProperties;

http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__DEVICE.html#group__CUDART__DEVICE_1g1bf9d625a931d657e08db2b4391170f0

It might be helpful to print the device ordinals to see at which specific device ordinal it fails. txbob may have additional ideas, e.g. what to look for in lspci output etc.

Was this system delivered with eight K80s by the vendor, or was this originally a six-GPU system (like your second system) and someone added two K80s later? If you are suspecting a defect K80, you could try cyclically exchanging the GPU to see whether the failure follows the GPU, but personally I would save myself the trouble of mucking with the hardware and talk to the vendor first. Mucking with the hardware yourself might also void the vendor’s warranty so check your paperwork.

Actually, to be clearer, the prior CentOS6 machines only had 4 K80s.

But these two newer ones weres delivered with 8 machines by microway, yes.

I’ll also look for ways to get a telling of the pci bus through cuda, but it’s worth pointing out that the error does not seem to be the same card each time.

I can compare the output of lspci | grep NVIDIA with nvidia-smi and I get the same reported bus for each card.

It does appear that it might be an issue with the script. It also appears that it can’t see/use more than 15 gpus. And in fact in https://github.com/lichinka/cuda-stress/blob/master/cudamemtest-1.2.3/cuda_memtest.cu there is

#define MAX_NUM_GPUS 8

Does anyone know why once upon a time that may have ever been necessary?