I’m trying to profile some code for a neural network, and in order to get more accurate profiling results, I tried to implement the whole neural network in CUDA with cuDNN + cuBLAS. The results are correct and everything, and execution times for all operations are similar to TensorFlow’s (I use the built-in tracing tools), except for the matrix multiplication part.

The matrices are 100x(24*24*32) and (24*24*32)x1, and it takes about 90us (microseconds) in TensorFlow, while in my code it takes 1.82 ms (1820us). Note that the second ‘matrix’ is actually a vector.

I run the gemm kernel as follows:

```
cublasSgemm(hnd_cuBLAS, CUBLAS_OP_N, CUBLAS_OP_N,
1, 100, 24*24*32,
&alpha,
dev_B, 1,
dev_A, 24*24*32,
&beta,
dev_C, 1);
```

where the matrices are in reverse order (basically compute B*A as suggested in the samples). The outputs are correct (element-by-element comparison + L1 norm). Is there something else I should keep into account?

UPDATE: nevermind, solved by correct use of the T op.