Why is my cuBLAS code extremely slow compared to TensorFlow's version?

I’m trying to profile some code for a neural network, and in order to get more accurate profiling results, I tried to implement the whole neural network in CUDA with cuDNN + cuBLAS. The results are correct and everything, and execution times for all operations are similar to TensorFlow’s (I use the built-in tracing tools), except for the matrix multiplication part.

The matrices are 100x(242432) and (242432)x1, and it takes about 90us (microseconds) in TensorFlow, while in my code it takes 1.82 ms (1820us). Note that the second ‘matrix’ is actually a vector.

I run the gemm kernel as follows:

cublasSgemm(hnd_cuBLAS, CUBLAS_OP_N, CUBLAS_OP_N,
					1, 100, 24*24*32,
					dev_B, 1,
					dev_A, 24*24*32,
					dev_C, 1);

where the matrices are in reverse order (basically compute B*A as suggested in the samples). The outputs are correct (element-by-element comparison + L1 norm). Is there something else I should keep into account?

UPDATE: nevermind, solved by correct use of the T op.

I tried to reorder the matrices as follows,

cublasSgemm(hnd_cuBLAS, CUBLAS_OP_N, CUBLAS_OP_N,
					100, 1, 24 * 24 * 32,//hA, wB, hB/wA
					dev_A, 100,
					dev_B, 24 * 24 * 32,

					dev_C, 100);

This runs at the same speed as the tensorflow version, but the results computed is (MATLAB code):


for A and B in row-major layout. Basically A is flattened column-wise and then reshaped into that size. Is there anyway to use the CUBLAS_OPs to get the correct result of A*B? I tried all 3 other combinations but it doesn’t work.