Can memory access patterns inside CUBLAS code be optimized?
As a concrete example, I profiled the simpleCUBLAS application that multiplies two NxN matrices with the command:
cublasSgemm(‘n’, ‘n’, N, N, N, alpha, d_A, N, d_B, N, beta, d_C, N);
When N=200, I see that there are uncoalesced stores: gst_uncoalesced is 6400 (with gst_coalesced=800). This was a surprise, I had expected CUBLAS code to use only coalesced operations. Can this be fixed, and should I be worried about this? (For example, should I try to write my own matrix multiply kernel using shared memory and without uncoalesced accesses?)
I searched the forum but couldn’t find an answer. Any pointers appreciated. Thanks.