Currently i am trying Tesla K40 from Nvidia trial. However, the performance is not as I expected. In the program that i test, there is a kernel and matrix solver (CG) using CUSP. The performance of the kernel is as expected, really fast 8x geforce gt650m on my MBP retina, however, the sparse matrix solver part (CUSP library) is 1.5x slower than my laptop GPU.
I’ve tried turning off ECC and activate GPU boost, but it is still slower, I’ve checked the global mem bandwidth and the results seems fine (Tesla K40 bandwidth is 6x gt650m). I’ve timed the cusp solver using omp_get_wtime and also cuda event, both of them gave same results.