0 Compatible Profiles for Llama 3.1 70B

ronanmcgovern · July 25, 2024, 11:39am

I’m running nvcr.io/nim/meta/llama-3.1-70b-instruct:latest on 1xH100 SXM (same issue with 2xH100 SXM).

I’m getting the error:

2024-07-25T11:16:04.260537238Z INFO 07-25 11:16:04.260 ngc_profile.py:224] Detected 0 compatible profile(s).
2024-07-25T11:16:04.260603041Z ERROR 07-25 11:16:04.260 utils.py:21] Could not find a profile that is currently runnable with the detected hardware. Please check the system information below and make sure you have enough free GPUs.
2024-07-25T11:16:04.260630682Z SYSTEM INFO
2024-07-25T11:16:04.260636242Z - Free GPUs:
2024-07-25T11:16:04.260640641Z   -  [2330:10de] (0) NVIDIA H100 80GB HBM3 (H100 80GB) [current utilization: 0%]

The right approach is probably to list the profiles, but I can’t do that because I don’t have a way to pass docker arguments. I tried guessing some profiles to try and set them with the NIM_MODEL_PROFILES env variable, see here but that didn’t work.

Can someone recommend what profile to pass? Or what other GPU to use if H100 SXMs aren’t supported (which would be odd).

neal.vaidya · July 25, 2024, 3:25pm

Hi @ronanmcgovern, at the moment we don’t support deploying llama-3.1-70b-instruct on a single (or 2) H100s with NIM – the minimum is 4.

You can see what’s supported on this page: Support Matrix - NVIDIA Docs

ronanmcgovern · July 25, 2024, 4:26pm

Ah shucks, yeah I looked at that page but didn’t appreciate that those are minimum numbers (makes sense).

Thanks Neal, just got a 4xH100 SXM running (and now it sees valid profiles) but hit this bug:

2024-07-25T16:21:08.921579444Z Error: Failed to initialize the TMA descriptor 1
2024-07-25T16:21:08.921582829Z TMA Desc Addr:   0x7f142535c2c0
2024-07-25T16:21:08.921586105Z format         9
2024-07-25T16:21:08.921589533Z dim            3
2024-07-25T16:21:08.921592714Z gmem_address   0
2024-07-25T16:21:08.921596218Z globalDim      (7168,1,1,1,1)
2024-07-25T16:21:08.921599419Z globalStrides  (2,0,0,0,0)
2024-07-25T16:21:08.921602459Z boxDim         (32,64,1,1,1)
2024-07-25T16:21:08.921606208Z elementStrides (1,1,1,1,1)
2024-07-25T16:21:08.921609518Z interleave     0
2024-07-25T16:21:08.921612924Z swizzle        2
2024-07-25T16:21:08.921615893Z l2Promotion    2
2024-07-25T16:21:08.921619688Z oobFill        0

Also, do you know whether there will be a 405B NIM out soon? I don’t see it just yet.

Tagging my friend @sagar.desai , also at Nvidia.

neal.vaidya · July 25, 2024, 4:41pm

Hey @ronanmcgovern that’s a new one to me, any chance you can upload the full logs, and any other environment info you can share?

For 405B it should be out in few days, we caught a last minute bug that needed fixing.

ronanmcgovern · July 31, 2024, 8:07am

Hi Neal, the issue solved itself and the 8B and 70B templates ran the next day (70B needs at least 4X H100 to run). Thanks for the response. BTW, you might find this video I made of interest - https://youtu.be/I0ccoL80h9Y

aktsvigun · October 23, 2024, 10:46am

Hi @neal.vaidya ,

I get the same error after launching NIM according to the instruction.
Precisely, I launch it with

docker run -it --rm --name=$CONTAINER_NAME \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -e NGC_API_KEY=$NGC_API_KEY \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -u $(id -u) \
  -p 8000:8000 \
  $IMG_NAME

And get a large list of such errors:

TMA Desc Addr:   0x7fd0667502c0
format         9
dim            3
gmem_address   0
globalDim      (4096,1,1,1,1)
globalStrides  (2,0,0,0,0)
boxDim         (32,64,1,1,1)
elementStrides (1,1,1,1,1)
interleave     0
swizzle        2
l2Promotion    2
oobFill        0
Error: Failed to initialize the TMA descriptor 1
TMA Desc Addr:   0x7fd0667502c0
format         9
dim            3
gmem_address   0
globalDim      (14336,1,1,1,1)
globalStrides  (2,0,0,0,0)
boxDim         (32,64,1,1,1)
elementStrides (1,1,1,1,1)
interleave     0
swizzle        2
l2Promotion    2
oobFill        0
Error: Failed to initialize the TMA descriptor 1
TMA Desc Addr:   0x7fd0667502c0
format         9
dim            3
gmem_address   0
globalDim      (14336,1,1,1,1)
globalStrides  (2,0,0,0,0)
boxDim         (32,64,1,1,1)
elementStrides (1,1,1,1,1)
interleave     0
swizzle        2
l2Promotion    2
oobFill        0
Error: Failed to initialize the TMA descriptor 1
TMA Desc Addr:   0x7fd0667502c0
format         9
dim            3
gmem_address   0
globalDim      (4096,1,1,1,1)
globalStrides  (2,0,0,0,0)
boxDim         (32,64,1,1,1)
elementStrides (1,1,1,1,1)
interleave     0
swizzle        2
l2Promotion    2
oobFill        0
Error: Failed to initialize the TMA descriptor 1
TMA Desc Addr:   0x7fd0667502c0
format         9
dim            3
gmem_address   0
globalDim      (14336,1,1,1,1)
globalStrides  (2,0,0,0,0)
boxDim         (32,64,1,1,1)
elementStrides (1,1,1,1,1)
interleave     0
swizzle        2
l2Promotion    2
oobFill        0
Error: Failed to initialize the TMA descriptor 1

My image name is nvcr.io/nim/meta/llama-3.1-8b-base:1.1.2 and I’m launching it on 1xH100.
Could you please help here?

neal.vaidya · October 28, 2024, 2:10am

Hi @aktsvigun – if you’re able to update your driver to version 550+, that should resolve this issue.

Topic		Replies	Views
How to fix 0 compatible profiles? Where to get compatible profiles? Models nim , llama-31-8b-instruct , llama	4	705	November 26, 2024
Profiles doesnt match machine even though specs are correct Models nim , llama-31-405b-instruct , llama	0	85	November 29, 2024
Unable to Run NIM on H100 GPU Due to Profile Compatibility Issue Despite Sufficient GPU Resources Models nim , llama-31-8b-instruct , llama	1	323	November 12, 2024
NIM Llama 3.3 70B requirements Models hw , nim , llama	2	532	March 21, 2025
How to fix 0 compatible profiles for L40S with mistral-7b-instruct-v03 NIM? Models gpu , nim , mistral-7b-instruct-v03	7	510	November 4, 2024
NIM does not support llama-3.1-8b-instruct and llama-3.1-70b-instruct on GH200 On-Prem deployment Models nim , llama-31-8b-instruct , llama	1	358	November 7, 2024
NIM TensorRT-LLM on H100 NVL Models nim , llama-31-8b-instruct , llama	2	303	November 22, 2024
CUDA fail start. Local NIM Containers run failed CUDA Setup and Installation nim , llama-31-405b-instruct , llama	2	307	September 20, 2024
RTX 4090 shows as "non-free GPU" when running NIM model in docker NVIDIA Nemotron nim	8	2510	October 21, 2024
NVIDIA NIM Container with CUDA out of Memory Problem Docker and NVIDIA Docker cuda , ubuntu , docker , nim , llama3-8b-instruct	2	731	September 20, 2024

0 Compatible Profiles for Llama 3.1 70B

Related topics