Riva ASR quickstart throws cudaError: "an illegal memory access was encountered"

Environment:

  • Hardware - GPU A100
  • Hardware - CPU Intel(R) Xeon(R) Gold 6342 CPU @ 2.80GHz
  • Operating System - Ubuntu 20.04.5 LTS
  • Riva Version - 2.10.0
  • NVidia driver version: 525.60.13
  • CUDA version: 12.0
  • Docker version: 23.0.1

Steps to reproduce:

bash riva_init.sh
bash riva_start.sh

And start producing any ASR streaming requests from client (nvidia-riva-client==2.10.0)

Results
Riva server can’t process any ASR requests and throwing a lot of errors like the following:

cudaError_t 700 : "an illegal memory access was encountered" returned from 'cudaMemset2DAsync( data_.get(), stride_ * sizeof(Real), 0, num_cols_ * sizeof(Real), num_rows_, cudaStreamPerThread)' in fileriva/utils/matrix/cu_matrix.cc line 122'

I also tried setting gpus_to_use=“all”, but nothing changed.
I want to note that this problem occurs with Riva versions starting from 2.8.0, while 2.7.0 works without any issues with the same configuration.

Hope these files could help during the investigation.

config.sh (12.7 KB)
nvidia-smi.txt (2.3 KB)
riva_init.log (150.2 KB)
riva_speech.log (2.0 MB)
riva_start.log (265 Bytes)

Any help on this would be much appreciated.
Thanks in advance!

Hi @vbilous

Thanks for your interest in Riva

Apologies you are facing issue,
Thanks for sharing the logs, I will check with the Riva team and provide updates

Thanks

We are also tried to run Riva on a different VM, but the result is still the same.

Environment:

  • Hardware - GPU A40
  • Hardware - CPU Intel(R) Xeon(R) Gold 6354 CPU @ 3.00GHz
  • Operating System - Ubuntu 20.04.5 LTS
  • Riva Version - 2.11.0
  • NVidia driver version: 525.105.17
  • CUDA version: 12.0
  • Docker version: 23.0.1

config.sh (13.3 KB)
nvidia-smi.log (1.5 KB)
riva_init.log (171.5 KB)
riva_speech.log (365.4 KB)
riva_start.log (333 Bytes)

Thanks in advance

HI @vbilous

Sincere Apologies for the delay,

I have not reached the complete triage and solution,

But i have a point to test

The Current CUDA version you have at your end is 12.0

Can you kindly downgrade to CUDA 11.8 and check
Riva works with CUDA 11.8

Please try and let us know, while i get more information

Thanks

Hi @vbilous

Apologies on the delay

This issue has been fixed and won’t happen in next release of Riva

Next Release should be out by tentatively in first week of July

Thanks

Hi @vbilous

I hope the 2.12.0 release did’nt help solve the issue,
One Question, can you confirm whether you are using a vGPU setup or a normal Baremetal Setup
If vGPU can you share the driver installed at guest and host

Thanks

Hi @vbilous

Can you perhaps once enable UVM and let me know if the issue still persists
Doc Reference : Virtual GPU Software User Guide :: NVIDIA Virtual GPU Software Documentation

Thanks

Hi @rvinobha,

I’m facing the same issue. It only happens when using more than one GPU. Do you have a solution?

Here is my log:

cudaError_t 700 : “an illegal memory access was encountered” returned from ‘cuda
MemcpyAsync( this->data_.get(), &vec[0], vec.size() * sizeof(Real), cudaMemcpyHo
stToDevice, cudaStreamPerThread)’ in fileriva/utils/matrix/cu_vector.cc line 90’
cudaError_t 700 : “an illegal memory access was encountered” returned from ‘cuda
MemcpyAsync( this->data_.get(), &vec[0], vec.size() * sizeof(Real), cudaMemcpyHo
stToDevice, cudaStreamPerThread)’ in fileriva/utils/matrix/cu_vector.cc line 90’
cudaError_t 700 : “an illegal memory access was encountered” returned from ‘cuda
GetLastError()’ in fileexternal/cu-feat-extr/src/cudafeat/feature-online-batched
-spectral-cuda.cc line 122’
cudaError_t 700 : “an illegal memory access was encountered” returned from ‘cuda
GetLastError()’ in fileexternal/cu-feat-extr/src/cudafeat/feature-online-batched
-spectral-cuda.cc line 149’
cudaError_t 700 : “an illegal memory access was encountered” returned from ‘cuda
GetLastError()’ in fileexternal/cu-feat-extr/src/cudafeat/feature-online-batched
-spectral-cuda.cc line 199’
cudaError_t 700 : “an illegal memory access was encountered” returned from ‘cuda
GetLastError()’ in fileexternal/cu-feat-extr/src/cudafeat/feature-online-batched
-spectral-cuda.cc line 257’
cudaError_t 700 : “an illegal memory access was encountered” returned from ‘cuda
Memset2DAsync( data_.get(), stride_ * sizeof(Real), 0, num_cols_ * sizeof(Real),
num_rows_, cudaStreamPerThread)’ in fileriva/utils/matrix/cu_matrix.cc line 122

I1014 00:53:35.369984 376 stats_builder.h:100] {“specversion”:“1.0”,“type”:“ri
va.asr.recognize.v1”,“source”:“”,“subject”:“”,“id”:“e44e9222-3d6c-4470-b964-b9e0
8dccb51d”,“datacontenttype”:“application/json”,“time”:“2023-10-14T00:53:34.91672
9835+00:00”,“data”:{“release_version”:“2.13.0”,“customer_uuid”:“”,“ngc_org”:“”,"
ngc_team":“”,“ngc_org_team”:“”,“container_uuid”:“”,“language_code”:“pt-BR”,“requ
est_count”:1,“audio_duration”:0.0,“speech_duration”:0.0,“status”:2,“err_msg”:“in
ensemble ‘conformer-pt-BR-asr-offline’, cudaMemcpy (DeviceToHost) failed on 'CL
ASS_LOGITS, device 0”}}

Thank you.