Environment:
- Hardware - GPU A100
- Hardware - CPU Intel(R) Xeon(R) Gold 6342 CPU @ 2.80GHz
- Operating System - Ubuntu 20.04.5 LTS
- Riva Version - 2.10.0
- NVidia driver version: 525.60.13
- CUDA version: 12.0
- Docker version: 23.0.1
Steps to reproduce:
bash riva_init.sh
bash riva_start.sh
And start producing any ASR streaming requests from client (nvidia-riva-client==2.10.0)
Results
Riva server can’t process any ASR requests and throwing a lot of errors like the following:
cudaError_t 700 : "an illegal memory access was encountered" returned from 'cudaMemset2DAsync( data_.get(), stride_ * sizeof(Real), 0, num_cols_ * sizeof(Real), num_rows_, cudaStreamPerThread)' in fileriva/utils/matrix/cu_matrix.cc line 122'
I also tried setting gpus_to_use=“all”, but nothing changed.
I want to note that this problem occurs with Riva versions starting from 2.8.0, while 2.7.0 works without any issues with the same configuration.
Hope these files could help during the investigation.
config.sh (12.7 KB)
nvidia-smi.txt (2.3 KB)
riva_init.log (150.2 KB)
riva_speech.log (2.0 MB)
riva_start.log (265 Bytes)
Any help on this would be much appreciated.
Thanks in advance!
Hi @vbilous
Thanks for your interest in Riva
Apologies you are facing issue,
Thanks for sharing the logs, I will check with the Riva team and provide updates
Thanks
We are also tried to run Riva on a different VM, but the result is still the same.
Environment:
- Hardware - GPU A40
- Hardware - CPU Intel(R) Xeon(R) Gold 6354 CPU @ 3.00GHz
- Operating System - Ubuntu 20.04.5 LTS
- Riva Version - 2.11.0
- NVidia driver version: 525.105.17
- CUDA version: 12.0
- Docker version: 23.0.1
config.sh (13.3 KB)
nvidia-smi.log (1.5 KB)
riva_init.log (171.5 KB)
riva_speech.log (365.4 KB)
riva_start.log (333 Bytes)
Thanks in advance
HI @vbilous
Sincere Apologies for the delay,
I have not reached the complete triage and solution,
But i have a point to test
The Current CUDA version you have at your end is 12.0
Can you kindly downgrade to CUDA 11.8 and check
Riva works with CUDA 11.8
Please try and let us know, while i get more information
Thanks
Hi @vbilous
Apologies on the delay
This issue has been fixed and won’t happen in next release of Riva
Next Release should be out by tentatively in first week of July
Thanks
Hi @vbilous
I hope the 2.12.0 release did’nt help solve the issue,
One Question, can you confirm whether you are using a vGPU setup or a normal Baremetal Setup
If vGPU can you share the driver installed at guest and host
Thanks
Hi @vbilous
Can you perhaps once enable UVM and let me know if the issue still persists
Doc Reference : Virtual GPU Software User Guide :: NVIDIA Virtual GPU Software Documentation
Thanks
Hi @rvinobha,
I’m facing the same issue. It only happens when using more than one GPU. Do you have a solution?
Here is my log:
cudaError_t 700 : “an illegal memory access was encountered” returned from ‘cuda
MemcpyAsync( this->data_.get(), &vec[0], vec.size() * sizeof(Real), cudaMemcpyHo
stToDevice, cudaStreamPerThread)’ in fileriva/utils/matrix/cu_vector.cc line 90’
cudaError_t 700 : “an illegal memory access was encountered” returned from ‘cuda
MemcpyAsync( this->data_.get(), &vec[0], vec.size() * sizeof(Real), cudaMemcpyHo
stToDevice, cudaStreamPerThread)’ in fileriva/utils/matrix/cu_vector.cc line 90’
cudaError_t 700 : “an illegal memory access was encountered” returned from ‘cuda
GetLastError()’ in fileexternal/cu-feat-extr/src/cudafeat/feature-online-batched
-spectral-cuda.cc line 122’
cudaError_t 700 : “an illegal memory access was encountered” returned from ‘cuda
GetLastError()’ in fileexternal/cu-feat-extr/src/cudafeat/feature-online-batched
-spectral-cuda.cc line 149’
cudaError_t 700 : “an illegal memory access was encountered” returned from ‘cuda
GetLastError()’ in fileexternal/cu-feat-extr/src/cudafeat/feature-online-batched
-spectral-cuda.cc line 199’
cudaError_t 700 : “an illegal memory access was encountered” returned from ‘cuda
GetLastError()’ in fileexternal/cu-feat-extr/src/cudafeat/feature-online-batched
-spectral-cuda.cc line 257’
cudaError_t 700 : “an illegal memory access was encountered” returned from ‘cuda
Memset2DAsync( data_.get(), stride_ * sizeof(Real), 0, num_cols_ * sizeof(Real),
num_rows_, cudaStreamPerThread)’ in fileriva/utils/matrix/cu_matrix.cc line 122
’
I1014 00:53:35.369984 376 stats_builder.h:100] {“specversion”:“1.0”,“type”:“ri
va.asr.recognize.v1”,“source”:“”,“subject”:“”,“id”:“e44e9222-3d6c-4470-b964-b9e0
8dccb51d”,“datacontenttype”:“application/json”,“time”:“2023-10-14T00:53:34.91672
9835+00:00”,“data”:{“release_version”:“2.13.0”,“customer_uuid”:“”,“ngc_org”:“”,"
ngc_team":“”,“ngc_org_team”:“”,“container_uuid”:“”,“language_code”:“pt-BR”,“requ
est_count”:1,“audio_duration”:0.0,“speech_duration”:0.0,“status”:2,“err_msg”:“in
ensemble ‘conformer-pt-BR-asr-offline’, cudaMemcpy (DeviceToHost) failed on 'CL
ASS_LOGITS, device 0”}}
Thank you.