I am using the level three BLAS function ssyrk() which multiplies a matrix with its transpose. My matrix size is n x 3. I do not see any gain in performance while doing this operation. My CPU multiplication is faster. Is there any reason for this? I would grealty appreciate any help.



depending on the size of your n, the cpu version may be faster.

For small N, the majority of the overhead will come in transferring the data to the card to perform the computation. the cpu blas ssyrk is probably quite fast and therefore the parallel speedup may not be realized.

Also, are you running directly on the card, or in emulation mode? Do you have any of the debug flags set?