Different versions of v100 could hold different number of tokens?


I am trying to deal with an OOM issue. The issue was caused by the fact that some of the input is too long (too many characters or tokens). I am using a batch size of 1 already.

My original understanding was that the usage of GPU memory was dominated by the number of tokens in the input. So I set an upper bound on the number of input tokens.

I then tested this on two types of virtual machines. These two types of VMs have the same type of GPU: V100, 16Gb. But the same token limit worked on one VM, didn’t work on the other one. For example, on one machine, 8000 tokens would work as a good limit (lead to no OOM)while on the other machine, 6500 tokens would (lead to no OOM). They were using the same input.
After a second look, seemed that there is a difference between the GPUs.
One is this: Tesla V100-PCIE, with 250W. Another one is Tesla V100-SXM2, with 300W.
There is a difference in terms of Driver version as well. The first one is 525.147.05 and the second one is 525.105.17. The CUDA versions are the same.
What could be the reason of this significant difference in terms of token limits? Could that be the version of the GPU, or is it the driver version? Thanks!

Hi @ud_heller1989 ,
This forum talks about issues related to TensorRT. However sharing more details and the detailed logs may help us understanding the issue better.


Thanks, @AakankshaS ! Is there a better place I could post this question? For more details: I am using Ubuntu 22.04 VMs. The summary of the question is: for the same code, same data, and same GPU (but not same version I guess. There are minor differences) I observe that there seems to be a different token limit that each GPU could hold. I want to know the reason behind this. Thanks!