A GPU is a programmable processor. That means it has a lot of flexibility to solve a lot of different kinds of computational tasks. Kind of like making a similar qualitative statement about a CPU.
It’s evident that GPUs are well suited to either neural network training or inference. Simply survey the state of the art in either of these disciplines.
Considering just a single GPU work, the biggest single factor for capability is probably memory size. The size of your neural network (the number of weights and biases, i.e. “parameters”) as well as the size of your data batches, when compared to your GPU memory size, allowing for overhead, will be the most proximal indicator what your GPU is “capable” of. Indeed, you can find many posts of people who are out of memory on their GPU when trying to run various NN codes. A common piece of advice is to either reduce batch size, or get a GPU with more memory. Very large models such as large language models with hundreds of billions of parameters may not fit on a single GPU, and may require specialized methods to distribute work across multiple GPUs.
I won’t be able to give you detailed recipes, calculations, spreadsheets, or calculators, to go from abstract discussion of model parameters and data batch sizes, to GPU memory consumption. The current methodology here is strongly biased towards trial and error. But nevertheless some crude statements can be made, such as the one above about models with many billions or trillions of parameters. Likewise, if your smallest dataset batch size is 8GB, it’s unlikely to be workable on a 4GB GPU.
Performance is a separate issue. More powerful GPUs will generally be more performant.
Mainstream neural network calculations are dominated by the matrix-matrix multiply operation. NVIDIA developed the tensor core (TC units) in large part to assist with this. When doing neural network calculations, on a NVIDIA GPU, use of tensorcores should be considered the “fast path”. You can find detailed tensor core calculations here. However, this should be considered a rough guide as to what to expect, performance-wise. If your code is making use of tensorcores (i.e. it is doing the layer-wise matrix-matrix multiply operations using a suitable type like FP16), then a GPU with more tensorcore throughput will likely run that code faster.
Regarding TOPs vs. TFLOPs, when a TC unit is computing using floating-point arithmetic, the throughput is generally indicated in TFLOPs/s. When a TC unit is computing using integer arithmetic, the throughput is generally indicated in TOPs.