I´m running a parabolic PDE simulation code using CUDA in a GTX 285 and everything was fine until I got a GTX 470. I was expecting to have a significant improvement in performance, but that was not the case. I noticed that for the task of solving linear systems at every time step, both GPUs had almost the same performance. I use finite elements and the resulting matrix is sparse and I store it using the DIAGONAL format, since my mesh is structured. I´m using single precision and I’m measuring the elapsed time (ET) to solve the linear systems and when I compare the ET for 470 and 285, they are very close…
My question is why the performance of both GPUs are so close?
I started to think that this is because in GTX 285 we have 30 MPs with 8 SP in each, whereas in GTX 470 we have only 14 with 32 SP in each MP…and somehow one thing “compesates” the other. Is that true?
Sparse matrix operations are generally memory-bandwidth limited. Both GPUs have roughly the same memory bandwidth, so it probably isn’t so surprising that their performance is similar.
What block size and register number do you use? Yuo may need to tweak appliaction a bit for best performance on new cards. Also check shared/L1 configuration.
My blocksize is 256 and i believe I’m using 8 registers (just compiled with --ptxas-options=-v and got the output).
Well, the performance of the kernel is OK for me, I think it is a little bit tricky to tune it.
I was more worried about the performance in one GPU versus the other, it was difficult to accept it. I was also afraid that there was something wrong in the kernel. Now I see that this kernel (the one that does sparse matrix vector multiplications) is memory bound. By the way, I also have another kernel where I solve thousand of systems of ODEs per time step, and this kernel had a much better performance on GTX 470 than on GTX 285…and I believe this kernel is compute bound, since each thread is assigned to solve one system and reads some values from global memory and then perform a lot (really) of mathematical operations and finally write out the data.