I’m trying to implement a beamforming imaging algorithm on a GPU. Each pixel in the final image is calculated by taking a specific vector, premultiplying by a (constant) matrix then post multiplying by the same vector, to get a single value.
I’ve currently got a version running in matlab (which has pretty optimised matrix multiplication routines) which takes around 60 seconds to complete.
Amongst other CPU versions, I’ve done a GPU version too. For this, I simply loop through all the pixels, and use CuBLAS to do the multiplications. The problem is that I don’t get any speed up - the program still takes around 55 seconds to complete on a GTX 285.
I’ve done some profiling and tried to calculate the gflops for the vector-matrix multiplication - the slowest process. The profiler gave a GPU time of roughly 190 microseconds for each matrix multiplication, which is for a single precision complex vector times a complex matrix. The matrix is 450 x 450 elements.
Each complex multiplication is 4 real multiplications and two real additions, making 6 flops per element in the matrix. The results then need to be summed down to a vector making roughly another 450 x 450 complex additions, or 450 x 450 x 2. I calculate this as 1.62e6 flops. This means that the complex CuBLAS multiplication is getting 8.53 Gflops/s. CUDA-Z (http://cuda-z.sourceforge.net/) reports I can get a maximum 705 Gflops/s from my card. While I don’t expect to get this with a realistic algorithm, I should be able to do better than I’m doing!
Can anyone suggest what is going wrong? Is my calculation actually correct (I’ve not done this much before)? Is there any way of improving the speed? Am I correct in assuming that CuBLAS algorithms are as optimised as possible?
Thanks a lot!