I am trying to deal with an OOM issue. The issue was caused by the fact that some of the input is too long (too many characters or tokens). I am using a batch size of 1 already.
My original understanding was that the usage of GPU memory was dominated by the number of tokens in the input. So I set an upper bound on the number of input tokens.
I then tested this on two types of virtual machines. These two types of VMs have the same type of GPU: V100, 16Gb. But the same token limit worked on one VM, didn’t work on the other one. For example, on one machine, 8000 tokens would work as a good limit (lead to no OOM)while on the other machine, 6500 tokens would (lead to no OOM). They were using the same input.
After a second look, seemed that there is a difference between the GPUs.
One is this: Tesla V100-PCIE, with 250W. Another one is Tesla V100-SXM2, with 300W.
There is a difference in terms of Driver version as well. The first one is 525.147.05 and the second one is 525.105.17. The CUDA versions are the same.
What could be the reason of this significant difference in terms of token limits? Could that be the version of the GPU, or is it the driver version? Thanks!