cudaMemGetInfo() gets error 304 after 50+ hours of stress test

hardenbergh · October 19, 2020, 5:21pm

I’m running a stress test with two processes rendering servers based on CUDA and a few other processes pushing lots of data at them. First one process gets a 304 error and then about 30 minutes later the second process gets the same error: Runtime API error 304: OS call failed or operation not supported on this OS.

The VM is using 2 Tesla M60 GPUs
CUDA 10.20
Driver: 443.66
System: Windows Server 2012 R2

The process memory is very stable. The CUDA memory has 1GB free and has not gotten any alloc errors.

The available system memory has fallen from 12,6GB to under 8GB for no reason that I can see.

Has anyone else seen anything like this or have any hints to follow?

Thanks!

Topic		Replies	Views
Under what circumstances can the cuda driver interface not be called? CUDA Programming and Performance	1	317	July 7, 2022
Limit on Cuda Contexts? CUDA Programming and Performance	0	502	January 27, 2017
CUDA out of memory CUDA Programming and Performance cuda , deep-learning	1	1014	July 8, 2021
There is no device supporting CUDA Error when running CUDA samples - There is no device supporting C CUDA Programming and Performance	1	5336	September 29, 2009
cudaError is 2 Help CUDA Programming and Performance	2	33	November 28, 2024
"no CUDA-capable device is available" after 2 hours simulation CUDA Programming and Performance	2	9290	March 2, 2010
Bug report: 8400M GS + Win7 errors errors and more errors CUDA Programming and Performance	0	4518	January 19, 2010
cuDevicePrimaryCtxRetain returns CUDA_ERROR_OUT_OF_MEMORY CUDA Programming and Performance	3	493	June 8, 2023
cudaMemGetInfo() takes several seconds on some setups CUDA Programming and Performance	1	953	August 9, 2013
Unable to run several CUDA samples. CUDA Programming and Performance	2	823	April 1, 2019

cudaMemGetInfo() gets error 304 after 50+ hours of stress test

Related topics