0 Compatible Profiles for Llama 3.1 70B

I’m running nvcr.io/nim/meta/llama-3.1-70b-instruct:latest on 1xH100 SXM (same issue with 2xH100 SXM).

I’m getting the error:

2024-07-25T11:16:04.260537238Z INFO 07-25 11:16:04.260 ngc_profile.py:224] Detected 0 compatible profile(s).
2024-07-25T11:16:04.260603041Z ERROR 07-25 11:16:04.260 utils.py:21] Could not find a profile that is currently runnable with the detected hardware. Please check the system information below and make sure you have enough free GPUs.
2024-07-25T11:16:04.260630682Z SYSTEM INFO
2024-07-25T11:16:04.260636242Z - Free GPUs:
2024-07-25T11:16:04.260640641Z   -  [2330:10de] (0) NVIDIA H100 80GB HBM3 (H100 80GB) [current utilization: 0%]

The right approach is probably to list the profiles, but I can’t do that because I don’t have a way to pass docker arguments. I tried guessing some profiles to try and set them with the NIM_MODEL_PROFILES env variable, see here but that didn’t work.

Can someone recommend what profile to pass? Or what other GPU to use if H100 SXMs aren’t supported (which would be odd).

Hi @ronanmcgovern, at the moment we don’t support deploying llama-3.1-70b-instruct on a single (or 2) H100s with NIM – the minimum is 4.

You can see what’s supported on this page: Support Matrix - NVIDIA Docs

Ah shucks, yeah I looked at that page but didn’t appreciate that those are minimum numbers (makes sense).

Thanks Neal, just got a 4xH100 SXM running (and now it sees valid profiles) but hit this bug:

2024-07-25T16:21:08.921579444Z Error: Failed to initialize the TMA descriptor 1
2024-07-25T16:21:08.921582829Z TMA Desc Addr:   0x7f142535c2c0
2024-07-25T16:21:08.921586105Z format         9
2024-07-25T16:21:08.921589533Z dim            3
2024-07-25T16:21:08.921592714Z gmem_address   0
2024-07-25T16:21:08.921596218Z globalDim      (7168,1,1,1,1)
2024-07-25T16:21:08.921599419Z globalStrides  (2,0,0,0,0)
2024-07-25T16:21:08.921602459Z boxDim         (32,64,1,1,1)
2024-07-25T16:21:08.921606208Z elementStrides (1,1,1,1,1)
2024-07-25T16:21:08.921609518Z interleave     0
2024-07-25T16:21:08.921612924Z swizzle        2
2024-07-25T16:21:08.921615893Z l2Promotion    2
2024-07-25T16:21:08.921619688Z oobFill        0

Also, do you know whether there will be a 405B NIM out soon? I don’t see it just yet.

Tagging my friend @sagar.desai , also at Nvidia.

Hey @ronanmcgovern that’s a new one to me, any chance you can upload the full logs, and any other environment info you can share?

For 405B it should be out in few days, we caught a last minute bug that needed fixing.

Hi Neal, the issue solved itself and the 8B and 70B templates ran the next day (70B needs at least 4X H100 to run). Thanks for the response. BTW, you might find this video I made of interest - https://youtu.be/I0ccoL80h9Y

Hi @neal.vaidya ,

I get the same error after launching NIM according to the instruction.
Precisely, I launch it with

docker run -it --rm --name=$CONTAINER_NAME \
  --runtime=nvidia \
  --gpus all \
  --shm-size=16GB \
  -e NGC_API_KEY=$NGC_API_KEY \
  -v "$LOCAL_NIM_CACHE:/opt/nim/.cache" \
  -u $(id -u) \
  -p 8000:8000 \
  $IMG_NAME

And get a large list of such errors:

TMA Desc Addr:   0x7fd0667502c0
format         9
dim            3
gmem_address   0
globalDim      (4096,1,1,1,1)
globalStrides  (2,0,0,0,0)
boxDim         (32,64,1,1,1)
elementStrides (1,1,1,1,1)
interleave     0
swizzle        2
l2Promotion    2
oobFill        0
Error: Failed to initialize the TMA descriptor 1
TMA Desc Addr:   0x7fd0667502c0
format         9
dim            3
gmem_address   0
globalDim      (14336,1,1,1,1)
globalStrides  (2,0,0,0,0)
boxDim         (32,64,1,1,1)
elementStrides (1,1,1,1,1)
interleave     0
swizzle        2
l2Promotion    2
oobFill        0
Error: Failed to initialize the TMA descriptor 1
TMA Desc Addr:   0x7fd0667502c0
format         9
dim            3
gmem_address   0
globalDim      (14336,1,1,1,1)
globalStrides  (2,0,0,0,0)
boxDim         (32,64,1,1,1)
elementStrides (1,1,1,1,1)
interleave     0
swizzle        2
l2Promotion    2
oobFill        0
Error: Failed to initialize the TMA descriptor 1
TMA Desc Addr:   0x7fd0667502c0
format         9
dim            3
gmem_address   0
globalDim      (4096,1,1,1,1)
globalStrides  (2,0,0,0,0)
boxDim         (32,64,1,1,1)
elementStrides (1,1,1,1,1)
interleave     0
swizzle        2
l2Promotion    2
oobFill        0
Error: Failed to initialize the TMA descriptor 1
TMA Desc Addr:   0x7fd0667502c0
format         9
dim            3
gmem_address   0
globalDim      (14336,1,1,1,1)
globalStrides  (2,0,0,0,0)
boxDim         (32,64,1,1,1)
elementStrides (1,1,1,1,1)
interleave     0
swizzle        2
l2Promotion    2
oobFill        0
Error: Failed to initialize the TMA descriptor 1

My image name is nvcr.io/nim/meta/llama-3.1-8b-base:1.1.2 and I’m launching it on 1xH100.
Could you please help here?

Hi @aktsvigun – if you’re able to update your driver to version 550+, that should resolve this issue.