We have installed the AODT on A6000 GPU. However, we meet some problem when we run the RAN simulation. The log file is listed as following:
[INFO] Did not find attributes sim:duration/sim:interval with a positive value so using slot/symbol instead
[DEBUG] Scenario users: 1, batches: 1, slot/symbol mode: 1, slots_per_batch: 10, samples_per_slot: 1, duration (per batch): 0.005, interval: 0.0005, ue_min_speed=1.5, ue_max_speed=2.5, is_seeded=0, seed=0
====================================
TDD pattern: DDDDDDDDDDDDDDDDDDDD
gNB power 43.00 dBm
UE power 26.00 dBm
gNB antennas 4
UE antennas 4
DL HARQ 0
UL HARQ 0
====================================
terminate called after throwing an instance of 'cuphy::cuphy_fn_exception'
what(): Function cuphyConvertTensor() returned CUPHY_STATUS_INTERNAL_ERROR: Internal error
Starting container...
The output of nvidia-smi is shown as following:
$ nvidia-smi
Thu Aug 1 14:54:37 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01 Driver Version: 535.183.01 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA RTX A6000 Off | 00000000:01:00.0 Off | Off |
| 30% 37C P8 26W / 300W | 275MiB / 49140MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 2591 G /usr/lib/xorg/Xorg 4MiB |
| 0 N/A N/A 617754 C ./aodt_sim 260MiB |
+---------------------------------------------------------------------------------------+
It looks like the program is correct initialized, but the function is terminated early. Could anyone help me to find the reasons of this?
Thank you.
@junxian Hi. May I ask if you have resolved this issue? I encountered the same problem in A6000 platform, where worker lost connection due to cuphy::cuphy_fn_exception shown in docker logs.
@guofachang In fact, we still not solve this problem now. We back to RTX 4090, reinstall the AODT, and the problem is solved. We will install the A6000 and trace the bug again.
@junxian your GPU is RTX A6000. If you look here : CUDA GPUs - Compute Capability | NVIDIA Developer, this GPU has a compute capability of 8.6. However, you are running a container corresponding to compute capability 8.9 (RTX 6000). We suspect that this is a problem.
RTX4090 is compute capability is 8.9, that is why it works.
You can run a container with compute capability 80. To do that:
Copy backend_bundule/docker-compose.yml to another file. e.g. backend_bundule/docker-compose-sm80.yml
Bring down the docker containers:
> docker compose down
@kpasad1 Thank you. However, the A6000 GPU has been reallocated to run another project at the moment. When I have the opportunity to install AODT again, I will follow your steps and report the results. Thank you very much.