Multiple gpu error

was doing some testing with multiple gpus and i get this error " Error: 2 UNKNOWN: in ensemble ‘citrinet-1024-en-US-asr-streaming’, audio_signal: failed to perform CUDA copy: an illegal memory access was encountered"

Hi @ryein

Thanks for your interest in Riva,

the error seems to be related to GPU

Request to kindly share details about

  1. GPUs Used (Model etc.)
  2. complete output of docker logs riva-speech
  3. config.sh used

Moreover the issue seems more related towards GPU Memory, So when running the riva server can we kindly track the memory usage using nvidia-smi

Thanks

1 Like

HI @ryein

One more kind inputs required

Can you kindly share the CUDA version used at your end,

Thanks

1 Like
  1. GPU - Tesla V100-SXM2 x 8
  2. Ubuntu Pastebin
  3. Stock but citrinet_1024 is selected english language and using all 8 GPUs devices=0,1,2,3,4,5,6,7
  4. Host OS has CUDA 12

When I was looking at the memory
GPU 1 has the highest used at around 10-11 Gigs and the others have 3gigs used. Memory never seems to be an issue. I think is just has to do with some error as when running this setup with only 1 card it works fine.

Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 CPU Affinity NUMA Affinity
GPU0 X NV2 NV1 NV2 NV1 SYS SYS SYS SYS 0-13,28-41 0
GPU1 NV2 X NV2 NV1 SYS NV1 SYS SYS SYS 0-13,28-41 0
GPU2 NV1 NV2 X NV1 SYS SYS NV2 SYS SYS 0-13,28-41 0
GPU3 NV2 NV1 NV1 X SYS SYS SYS NV2 SYS 0-13,28-41 0
GPU4 NV1 SYS SYS SYS X NV2 NV1 NV2 PHB 14-27,42-55 1
GPU5 SYS NV1 SYS SYS NV2 X NV2 NV1 PHB 14-27,42-55 1
GPU6 SYS SYS NV2 SYS NV1 NV2 X NV1 PHB 14-27,42-55 1
GPU7 SYS SYS SYS NV2 NV2 NV1 NV1 X PHB 14-27,42-55 1
NIC0 SYS SYS SYS SYS PHB PHB PHB PHB X

Hi @ryein

Thanks for sharing the details

Can you try running riva with CUDA 11.8 instead of 12

Thanks

1 Like

I was trying to get into it more deeply to install a different version and nvcc reports something different from nvidia-smi

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0


NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 |

This is the host system

i installed 11.8 directly from the nvidia website and still got the erorr.

Hi @ryein

Thanks for checking this out,

I will escalate the issue with the internal team and provide update

Thanks

Hi,

I’m having the same issue with 2 RTX-3090. It only happens when using the 2 GPUs.

Have you received any solution?

Here is my log:

cudaError_t 700 : “an illegal memory access was encountered” returned from ‘cuda
MemcpyAsync( this->data_.get(), &vec[0], vec.size() * sizeof(Real), cudaMemcpyHo
stToDevice, cudaStreamPerThread)’ in fileriva/utils/matrix/cu_vector.cc line 90’
cudaError_t 700 : “an illegal memory access was encountered” returned from ‘cuda
MemcpyAsync( this->data_.get(), &vec[0], vec.size() * sizeof(Real), cudaMemcpyHo
stToDevice, cudaStreamPerThread)’ in fileriva/utils/matrix/cu_vector.cc line 90’
cudaError_t 700 : “an illegal memory access was encountered” returned from ‘cuda
GetLastError()’ in fileexternal/cu-feat-extr/src/cudafeat/feature-online-batched
-spectral-cuda.cc line 122’
cudaError_t 700 : “an illegal memory access was encountered” returned from ‘cuda
GetLastError()’ in fileexternal/cu-feat-extr/src/cudafeat/feature-online-batched
-spectral-cuda.cc line 149’
cudaError_t 700 : “an illegal memory access was encountered” returned from ‘cuda
GetLastError()’ in fileexternal/cu-feat-extr/src/cudafeat/feature-online-batched
-spectral-cuda.cc line 199’
cudaError_t 700 : “an illegal memory access was encountered” returned from ‘cuda
GetLastError()’ in fileexternal/cu-feat-extr/src/cudafeat/feature-online-batched
-spectral-cuda.cc line 257’
cudaError_t 700 : “an illegal memory access was encountered” returned from ‘cuda
Memset2DAsync( data_.get(), stride_ * sizeof(Real), 0, num_cols_ * sizeof(Real),
num_rows_, cudaStreamPerThread)’ in fileriva/utils/matrix/cu_matrix.cc line 122

I1014 00:53:35.369984 376 stats_builder.h:100] {“specversion”:“1.0”,“type”:“ri
va.asr.recognize.v1”,“source”:“”,“subject”:“”,“id”:“e44e9222-3d6c-4470-b964-b9e0
8dccb51d”,“datacontenttype”:“application/json”,“time”:“2023-10-14T00:53:34.91672
9835+00:00”,“data”:{“release_version”:“2.13.0”,“customer_uuid”:“”,“ngc_org”:“”,"
ngc_team":“”,“ngc_org_team”:“”,“container_uuid”:“”,“language_code”:“pt-BR”,“requ
est_count”:1,“audio_duration”:0.0,“speech_duration”:0.0,“status”:2,“err_msg”:“in
ensemble ‘conformer-pt-BR-asr-offline’, cudaMemcpy (DeviceToHost) failed on 'CL
ASS_LOGITS, device 0”}}