NVLink VRAM Memory Pool

Say we have a model that’s around 20gigs and we have built a system using 2x3090s which is on the bible’s top 3 Perf/$ graph. Without using NVLink we would have to load the same model twice on the two cards and use a tiny batch size which to me, sounds like a wasteful thing to do.

Hence the question boil down to,

  1. If we used NVlink would we be able to pool the 2x24GB of memory of the two cards and maximize the inference batch size?
  2. If not it’d be wiser to pick a card with more VRAM. Is this PoV fundamentally wrong? Would this be a version of model parallelism since we have one instance of the model?
  3. I guess it wouldn’t because the model can potentially fit in one GPU but since my approach is to pool the global memory of both GPUs it’s the same idea as model parallelism

Thank you so much :))

I’ll be attaching some useful resources here by editing the post,

Hi @lemonsqueeze ,
Apologies for the delay,
This forum talks about issues related to tensorrt, and for your concern, i might not have the answer.