I’m trying to implement a beamforming imaging algorithm on a GPU. Each pixel in the final image is calculated by taking a specific vector, premultiplying by a (constant) matrix then post multiplying by the same vector, to get a single value.
I’ve currently got a version running in matlab (which has pretty optimised matrix multiplication routines) which takes around 60 seconds to complete.
Amongst other CPU versions, I’ve done a GPU version too. For this, I simply loop through all the pixels, and use CuBLAS to do the multiplications. The problem is that I don’t get any speed up - the program still takes around 55 seconds to complete on a GTX 285.
I’ve done some profiling and tried to calculate the gflops for the vector-matrix multiplication - the slowest process. The profiler gave a GPU time of roughly 190 microseconds for each matrix multiplication, which is for a single precision complex vector times a complex matrix. The matrix is 450 x 450 elements.
Each complex multiplication is 4 real multiplications and two real additions, making 6 flops per element in the matrix. The results then need to be summed down to a vector making roughly another 450 x 450 complex additions, or 450 x 450 x 2. I calculate this as 1.62e6 flops. This means that the complex CuBLAS multiplication is getting 8.53 Gflops/s. CUDA-Z (http://cuda-z.sourceforge.net/) reports I can get a maximum 705 Gflops/s from my card. While I don’t expect to get this with a realistic algorithm, I should be able to do better than I’m doing!
Can anyone suggest what is going wrong? Is my calculation actually correct (I’ve not done this much before)? Is there any way of improving the speed? Am I correct in assuming that CuBLAS algorithms are as optimised as possible?
Thanks for your reply. I’ll try to address your questions below.
Yes.
The constant matrix is 450x450.
Yes. This vector is different for each pixel.
Yes.
The function I use for M*V(x,y) is cublasCgemv and for the dot product is cublasCdotu. The performance estimate in my previous post is specifically for cublasCgemv.
Both the vector and the matrix are complex single precision floats. The vector is 450 elements long.
Thanks for your reply. I’ll try to address your questions below.
Yes.
The constant matrix is 450x450.
Yes. This vector is different for each pixel.
Yes.
The function I use for M*V(x,y) is cublasCgemv and for the dot product is cublasCdotu. The performance estimate in my previous post is specifically for cublasCgemv.
Both the vector and the matrix are complex single precision floats. The vector is 450 elements long.
I don’t think you want to hijack a 7 year old thread.
Post a new topic with a meaningful title and people can find your posting much easier.
In this thread you’ll mostly find people interested in beamforming imaging.
Hint: if you google for cublasCgemv tutorial you should find an example for cublasSgemv which works on single precision float instead of cuComplex. But the code could be modified to work with complex numbers.