Laptop gpu choice

I’m considering getting a laptop with a CUDA enabled GPU to use as a training tool for myself and could use some advice on choosing the GPU. Not long ago I understood key gpu factors were # of cores and memory width. Is this still the case? I’d think that every generation would be “better” than the previous, but I’m confused by the specs I see. For example, RTX 3070 has 5120 cores and 256-bit memory width, but the newer generation RTX 4070 has only 4608 cores and 128-bit memory width. Yet, a gpu user benchmark site shows the 4070 outperforming the 3070 (in gaming). Does gaming performance correlate with CUDA computational performance, or would I expect the older 3070 to outperform the 4070 for CUDA computations?

mobile 3070 vs 4070

I would say that’s a misunderstanding, because it leaves out clock frequency and architecture. If you want to just compare based on a couple of performance metrics, you would want to look at FP32 TFLOPS and memory bandwidth. Citing from the TechPowerUp database (these are non-mobile numbers, I would think):

RTX 3070: 20.31 FP32 TFLOPS, 448.0 GB/sec
RTX 4070: 29.15 FP32 TFLOPS, 504.2 GB/sec

So depending on wether the code in question is compute bound or memory bound, one might expect a speed-up of 1.1x to 1.5x when moving from an RTX 3070 to an RTX 4070. Keep in mind that this is more of a trend indicator than an exact prediction. To make sure the code performs as desired, you would want to benchmark it on prospective target platforms.

Different games among themselves will have different performance characteristics and therefore scale by different factors when moving between GPUs, and the same applies when comparing scaling of games vs compute apps. Games may also make use of features like ray tracing that are irrelevant to compute application and vice versa. I would say for a very rough idea of relative performance between GPUs (with very large uncertainty bands to either side), looking at game performance might work if there is no other data available.

Thanks for your thoughtful reply! I didn’t realize the TechPowerUp database had TFLOP information. Interestingly, the 3070 vs 4070 comparison looks much different using the mobile (laptop) versions:

RTX 3070 Mobile: 15.97 FP32 TFLOPS, 448.0 GB/sec, 160 tensor cores
RTX 4070 Mobile: 15.62 FP32 TFLOPS, 256.0 GB/sec, 144 tensor cores

Unless the data are wrong it looks like the 3070 is a better choice, especially for memory bound computations.

From what I understand the TechPowerUp database is crowd based and a volunteer effort like Wikipedia. Errors are possible. I never deal with laptop GPUs and have found the TechowerUp largely accurate for desktop GPU (consumer and professional).

In this case, NVIDIA’s “spec” pages are particularly unhelpful for non-gamers:

Maybe the marketing department did not have their thinking caps on when they created those pages, or the pages were written this way on purpose to impress potential buyers with “bling” instead of “speeds & feeds”. You decide.

The best approach is always to benchmark the applications whose performance you care about. Maybe friends or colleagues can help you with getting access to relevant systems. I am not aware of laptop configurations being available in the cloud, but I am not a cloud expert.

[Later:]

The memory throughput data in the TechPowerUp database is consistent with the other memory specifications, so there does not seem to be an obvious mistake:

RTX 3070 mobile
-------------------------
Memory Clock    1750 MHz 
Memory Type     GDDR6
Memory Bus      256 bit 
Bandwidth       448.0 GB/s 

RTX 4070 mobile
-------------------------
Memory Clock    2000 MHz 
Memory Type     GDDR6
Memory Bus      128 bit 
Bandwidth       256.0 GB/s 

I also see complaints floating around the internet such as “RTX 4000 mobile series bottlenecked by poor memory bandwidth”, which would seem consistent with the above data.