was doing some testing with multiple gpus and i get this error " Error: 2 UNKNOWN: in ensemble ‘citrinet-1024-en-US-asr-streaming’, audio_signal: failed to perform CUDA copy: an illegal memory access was encountered"
Hi @ryein
Thanks for your interest in Riva,
the error seems to be related to GPU
Request to kindly share details about
- GPUs Used (Model etc.)
- complete output of
docker logs riva-speech
- config.sh used
Moreover the issue seems more related towards GPU Memory, So when running the riva server can we kindly track the memory usage using nvidia-smi
Thanks
HI @ryein
One more kind inputs required
Can you kindly share the CUDA version used at your end,
Thanks
- GPU - Tesla V100-SXM2 x 8
- Ubuntu Pastebin
- Stock but citrinet_1024 is selected english language and using all 8 GPUs devices=0,1,2,3,4,5,6,7
- Host OS has CUDA 12
When I was looking at the memory
GPU 1 has the highest used at around 10-11 Gigs and the others have 3gigs used. Memory never seems to be an issue. I think is just has to do with some error as when running this setup with only 1 card it works fine.
Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 CPU Affinity NUMA Affinity
GPU0 X NV2 NV1 NV2 NV1 SYS SYS SYS SYS 0-13,28-41 0
GPU1 NV2 X NV2 NV1 SYS NV1 SYS SYS SYS 0-13,28-41 0
GPU2 NV1 NV2 X NV1 SYS SYS NV2 SYS SYS 0-13,28-41 0
GPU3 NV2 NV1 NV1 X SYS SYS SYS NV2 SYS 0-13,28-41 0
GPU4 NV1 SYS SYS SYS X NV2 NV1 NV2 PHB 14-27,42-55 1
GPU5 SYS NV1 SYS SYS NV2 X NV2 NV1 PHB 14-27,42-55 1
GPU6 SYS SYS NV2 SYS NV1 NV2 X NV1 PHB 14-27,42-55 1
GPU7 SYS SYS SYS NV2 NV2 NV1 NV1 X PHB 14-27,42-55 1
NIC0 SYS SYS SYS SYS PHB PHB PHB PHB X
Hi @ryein
Thanks for sharing the details
Can you try running riva with CUDA 11.8 instead of 12
Thanks
I was trying to get into it more deeply to install a different version and nvcc reports something different from nvidia-smi
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2021 NVIDIA Corporation
Built on Thu_Nov_18_09:45:30_PST_2021
Cuda compilation tools, release 11.5, V11.5.119
Build cuda_11.5.r11.5/compiler.30672275_0
NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 |
This is the host system
i installed 11.8 directly from the nvidia website and still got the erorr.
Hi @ryein
Thanks for checking this out,
I will escalate the issue with the internal team and provide update
Thanks