was doing some testing with multiple gpus and i get this error " Error: 2 UNKNOWN: in ensemble ‘citrinet-1024-en-US-asr-streaming’, audio_signal: failed to perform CUDA copy: an illegal memory access was encountered"
Hi @ryein
Thanks for your interest in Riva,
the error seems to be related to GPU
Request to kindly share details about
- GPUs Used (Model etc.)
- complete output of
docker logs riva-speech
- config.sh used
Moreover the issue seems more related towards GPU Memory, So when running the riva server can we kindly track the memory usage using nvidia-smi
Thanks
HI @ryein
One more kind inputs required
Can you kindly share the CUDA version used at your end,
Thanks
- GPU - Tesla V100-SXM2 x 8
- Ubuntu Pastebin
- Stock but citrinet_1024 is selected english language and using all 8 GPUs devices=0,1,2,3,4,5,6,7
- Host OS has CUDA 12
When I was looking at the memory
GPU 1 has the highest used at around 10-11 Gigs and the others have 3gigs used. Memory never seems to be an issue. I think is just has to do with some error as when running this setup with only 1 card it works fine.
Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 CPU Affinity NUMA Affinity
GPU0 X NV2 NV1 NV2 NV1 SYS SYS SYS SYS 0-13,28-41 0
GPU1 NV2 X NV2 NV1 SYS NV1 SYS SYS SYS 0-13,28-41 0
GPU2 NV1 NV2 X NV1 SYS SYS NV2 SYS SYS 0-13,28-41 0
GPU3 NV2 NV1 NV1 X SYS SYS SYS NV2 SYS 0-13,28-41 0
GPU4 NV1 SYS SYS SYS X NV2 NV1 NV2 PHB 14-27,42-55 1
GPU5 SYS NV1 SYS SYS NV2 X NV2 NV1 PHB 14-27,42-55 1
GPU6 SYS SYS NV2 SYS NV1 NV2 X NV1 PHB 14-27,42-55 1
GPU7 SYS SYS SYS NV2 NV2 NV1 NV1 X PHB 14-27,42-55 1
NIC0 SYS SYS SYS SYS PHB PHB PHB PHB X
Hi @ryein
Thanks for sharing the details
Can you try running riva with CUDA 11.8 instead of 12
Thanks
I was trying to get into it more deeply to install a different version and nvcc reports something different from nvidia-smi
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0
NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 |
This is the host system
i installed 11.8 directly from the nvidia website and still got the erorr.
Hi @ryein
Thanks for checking this out,
I will escalate the issue with the internal team and provide update
Thanks
Hi,
I’m having the same issue with 2 RTX-3090. It only happens when using the 2 GPUs.
Have you received any solution?
Here is my log:
cudaError_t 700 : “an illegal memory access was encountered” returned from ‘cuda
MemcpyAsync( this->data_.get(), &vec[0], vec.size() * sizeof(Real), cudaMemcpyHo
stToDevice, cudaStreamPerThread)’ in fileriva/utils/matrix/cu_vector.cc line 90’
cudaError_t 700 : “an illegal memory access was encountered” returned from ‘cuda
MemcpyAsync( this->data_.get(), &vec[0], vec.size() * sizeof(Real), cudaMemcpyHo
stToDevice, cudaStreamPerThread)’ in fileriva/utils/matrix/cu_vector.cc line 90’
cudaError_t 700 : “an illegal memory access was encountered” returned from ‘cuda
GetLastError()’ in fileexternal/cu-feat-extr/src/cudafeat/feature-online-batched
-spectral-cuda.cc line 122’
cudaError_t 700 : “an illegal memory access was encountered” returned from ‘cuda
GetLastError()’ in fileexternal/cu-feat-extr/src/cudafeat/feature-online-batched
-spectral-cuda.cc line 149’
cudaError_t 700 : “an illegal memory access was encountered” returned from ‘cuda
GetLastError()’ in fileexternal/cu-feat-extr/src/cudafeat/feature-online-batched
-spectral-cuda.cc line 199’
cudaError_t 700 : “an illegal memory access was encountered” returned from ‘cuda
GetLastError()’ in fileexternal/cu-feat-extr/src/cudafeat/feature-online-batched
-spectral-cuda.cc line 257’
cudaError_t 700 : “an illegal memory access was encountered” returned from ‘cuda
Memset2DAsync( data_.get(), stride_ * sizeof(Real), 0, num_cols_ * sizeof(Real),
num_rows_, cudaStreamPerThread)’ in fileriva/utils/matrix/cu_matrix.cc line 122
’
I1014 00:53:35.369984 376 stats_builder.h:100] {“specversion”:“1.0”,“type”:“ri
va.asr.recognize.v1”,“source”:“”,“subject”:“”,“id”:“e44e9222-3d6c-4470-b964-b9e0
8dccb51d”,“datacontenttype”:“application/json”,“time”:“2023-10-14T00:53:34.91672
9835+00:00”,“data”:{“release_version”:“2.13.0”,“customer_uuid”:“”,“ngc_org”:“”,"
ngc_team":“”,“ngc_org_team”:“”,“container_uuid”:“”,“language_code”:“pt-BR”,“requ
est_count”:1,“audio_duration”:0.0,“speech_duration”:0.0,“status”:2,“err_msg”:“in
ensemble ‘conformer-pt-BR-asr-offline’, cudaMemcpy (DeviceToHost) failed on 'CL
ASS_LOGITS, device 0”}}