TK1 vs Geforce 680

I just got tegra k1 dev kit. i’m a researcher and do a lot of DSP acceleration on GPU. I have a geforce 680 that i have been using over the years and compared to 680, tk1’s runtime is a lot slower which I find surprising. I’d like to know if some of you are having similar performance drop. for example I run cufft (about 4000 pt fft) and on geforce 680 it runs 0.2 ms and on tk1 it’s running like 10 ms. i’m using same identical code on both machines. only difference is i compiled to sm_32 arch for tk1 as it should be otherwise it’s the same. has anyone noticed performance drop like this? in general different kernels are running at least 10 times slower compared to 680. both are built on kepler and yes tk1 only has 1 smx. I was hoping that because it’s an embedded device it might run comparable to 680 if not slightly faster. in addition i’m not running a large # of threads here, around 4000 threads. no register spillage, etc. let me know what you are getting, i’m under a time crunch due to paper deadline and really like to include tk1 results but if it runs this slow i’m not so sure…

also the run time seems to be inconsistent. for example on 680, it runs 2 ms +/-0.1 ms but on tk1 it fluctuates a lot between say 10 ms to 50 ms

Doesn’t 10x sound about right?

  • GFLOPS: 300 vs. 3000
  • GB/sec: 14 vs. 192
  • cores: 192 vs. 1536
  • registers: 2^15 vs. 2^19 = 16x
  • shared mem (KB): 48 vs. 384
  • Watts: <10 vs. <225

Pretty sure that the K1’s sm_32 SMX has 32K registers and not the standard Kepler/Maxwell 64K registers.

Since it seems you’re interested in seeing how your work scales across the GPUs you might also want to try to get a $45 GT630/635 with a GK208 chip. It has even less device mem bandwidth than the TK1 (!) at 14.4 GB/sec. But it has 2 full sm_35 SMXs for 384 cores. It’s really cool to compare CUDA code on the 680 (which remains an utter beast of a card), the GK208 and a 750 Ti Maxwell. Can’t wait to add the TK1 to that list.

Just comparing the power consumption, 200W for the 680 vs <10W (estimated), that’s a 20x power reduction. What you gain in the TK1 is portability/efficiency and cost, not performance…

I understand your disappointment though. You have to think, is the application you’re looking to use this one, power, noise or size-constrained? If not, then stick with the 680.

Jon

yes great info. i did more background reading on tk1 and yea it’s a different animal. i’ll keep playing around with it and see what happens. cuda6.0 , they have a lot of new features so i have to learn that stuff as well, it’s a bit of step up from 5.5. i have been using older 260, 460, 680 cards in the past and cuda over the years. it runs well out of the box, but make sure to connec to ehternet when isntall toolkit. it runs pretty slow too overall.

what i noticed now is that when i include memory xfer in my runtime measurement (cuda timer that is), the run time doesn’t seem to change compare to no memory xfer . i assume that’s because there’s no pci express and it’s all on chip so that latency may be negligible, that’s an encouraging news.

i’d like to hear more more feedback from others regarding desktop gpu vs mobile gpu performance. speedup/down, runtime measurement, etc. i’ll share more as i go along as well. keep em coming. thanks.