Cuphy_fn_exception in AODT

Hello,

We have installed the AODT on A6000 GPU. However, we meet some problem when we run the RAN simulation. The log file is listed as following:

[INFO] Did not find attributes sim:duration/sim:interval with a positive value so using slot/symbol instead
[DEBUG] Scenario users: 1, batches: 1, slot/symbol mode: 1, slots_per_batch: 10, samples_per_slot: 1, duration (per batch): 0.005, interval: 0.0005, ue_min_speed=1.5, ue_max_speed=2.5, is_seeded=0, seed=0

====================================

TDD pattern: DDDDDDDDDDDDDDDDDDDD
gNB power      43.00 dBm
UE power       26.00 dBm
gNB antennas   4
UE antennas    4
DL HARQ        0
UL HARQ        0
====================================
terminate called after throwing an instance of 'cuphy::cuphy_fn_exception'
  what():  Function cuphyConvertTensor() returned CUPHY_STATUS_INTERNAL_ERROR: Internal error
Starting container...

The output of nvidia-smi is shown as following:

$ nvidia-smi
Thu Aug  1 14:54:37 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A6000               Off | 00000000:01:00.0 Off |                  Off |
| 30%   37C    P8              26W / 300W |    275MiB / 49140MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2591      G   /usr/lib/xorg/Xorg                            4MiB |
|    0   N/A  N/A    617754      C   ./aodt_sim                                  260MiB |
+---------------------------------------------------------------------------------------+

It looks like the program is correct initialized, but the function is terminated early. Could anyone help me to find the reasons of this?
Thank you.

Best Regards,
JunXian

@junxian

  1. Are you able to run a non RAN simulation?
  2. Looks like you modifed the default TDD scenario to DDDDDDDDDDDDDDDDDDDD. Are you able to run a default TDD scenario?
  1. Yes, we can run the EM solver and display the paths.
  2. We are currently reinstalling AODT and will try again later.

Thank you.

For the second question, I tried running ‘ran_sim’ on the RTX 4090 server and obtained the following results:

[DEBUG] Scenario users: 3, batches: 10, slot/symbol mode: 1, slots_per_batch: 3, samples_per_slot: 1, duration (per batch): 0.0015, interval: 0.0005, ue_min_speed=1.5, ue_max_speed=2.5, is_seeded=0, seed=0

====================================

TDD pattern: DDDDDDDDDDDDDDDDDDDD
gNB power      43.00 dBm
UE power       26.00 dBm
gNB antennas   4
UE antennas    4
DL HARQ        0
UL HARQ        0
====================================
[INFO] Computing all links for cell association for batch 0 (wideband CFRs enabled, fft_size=4096)...
[DEBUG] Found 135 paths for Tx 2
[DEBUG] Found 204 paths for Tx 1
[DEBUG] Timing EMSolver: 12.8861 ms

InitCellAssoc - cell=0 numUes=1 batch=0
InitCellAssoc = cellUeAssoc[0][0]=2
InitCellAssoc - cell=1 numUes=2 batch=0
InitCellAssoc = cellUeAssoc[1][0]=0
InitCellAssoc = cellUeAssoc[1][1]=1[INFO] Computing sample=1/30, batch=0, slot=0, sample_within_slot=0 (wideband CFRs enabled, fft_size=4096)
[DEBUG] Found 540 paths for Tx 2
[DEBUG] Found 816 paths for Tx 1
[DEBUG] Timing EMSolver: 19.0548 ms
[DEBUG] Found 540 paths for Tx 2
[DEBUG] Found 816 paths for Tx 1
[DEBUG] Timing EMSolver: 19.4291 ms

==============================================  results  ================================================
cell idx    grp idx   rnti     startPrb     nPrb    MCS   layer     RV   sinrPreEq    sinrPostEq     CRC
   0         0          2         0         272       0      2       0    35.80        20.24           0
   1         1          0       140         132       0      2       0    36.58        37.16           0
   1         2          1         0         140       0      1       0    -0.45        18.42           0
=========================================================================================================
Note: sinrPostEq is capped at 40 dB and floored at -10 dB.
[DEBUG] Adding 2712 ray results
[DEBUG] Adding 393216 CFR results to clickhouse db
[INFO] Computing sample=2/30, batch=0, slot=1, sample_within_slot=0 (wideband CFRs enabled, fft_size=4096)
[DEBUG] Found 540 paths for Tx 2
[DEBUG] Found 816 paths for Tx 1
[DEBUG] Timing EMSolver: 20.2044 ms
[DEBUG] Found 540 paths for Tx 2
[DEBUG] Found 816 paths for Tx 1
[DEBUG] Timing EMSolver: 19.4743 ms

It seems the error is not caused by the configuration of ‘ran_sim’. Are there any specific configurations required for the A6000 GPU?

I have also listed the ‘nvidia-smi’ results below:

$ nvidia-smi
Fri Aug  2 21:32:07 2024
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 4090        Off | 00000000:01:00.0 Off |                  Off |
|  0%   36C    P8              11W / 450W |  13033MiB / 24564MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2486      G   /usr/lib/xorg/Xorg                            4MiB |
|    0   N/A  N/A     72826      C   ./aodt_sim                                12978MiB |
+---------------------------------------------------------------------------------------+

@junxian Please enable debug logs via the --log debug flag. e.g.
OMNI_USER=omniverse OMNI_PASS=aerial_123456 ./build/aodt_sim --nucleus omniverse://
--log debug

@junxian Hi. May I ask if you have resolved this issue? I encountered the same problem in A6000 platform, where worker lost connection due to cuphy::cuphy_fn_exception shown in docker logs.

@kpasad1 Thank you. We will rebuild the environment and follow your steps.

@guofachang In fact, we still not solve this problem now. We back to RTX 4090, reinstall the AODT, and the problem is solved. We will install the A6000 and trace the bug again.

@junxian your GPU is RTX A6000. If you look here : CUDA GPUs - Compute Capability | NVIDIA Developer, this GPU has a compute capability of 8.6. However, you are running a container corresponding to compute capability 8.9 (RTX 6000). We suspect that this is a problem.
RTX4090 is compute capability is 8.9, that is why it works.
You can run a container with compute capability 80. To do that:

  1. Copy backend_bundule/docker-compose.yml to another file. e.g. backend_bundule/docker-compose-sm80.yml

  2. Bring down the docker containers:
    > docker compose down

  3. edit the docker-compose-sm80.yml with a text editor and make the following change:
    image: nvcr.io/esee5uzbruax/aodt-sim:1.0.0_runtime_$GEN_CODE → image: nvcr.io/esee5uzbruax/aodt-sim:1.0.0_runtime_SM80

  4. Restart the container:
    >docker compose -f docker-compose-sm80.yml up

Let us know how it goes.

@kpasad1 Thank you. However, the A6000 GPU has been reallocated to run another project at the moment. When I have the opportunity to install AODT again, I will follow your steps and report the results. Thank you very much.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.