I’m trying to profile some code for a neural network, and in order to get more accurate profiling results, I tried to implement the whole neural network in CUDA with cuDNN + cuBLAS. The results are correct and everything, and execution times for all operations are similar to TensorFlow’s (I use the built-in tracing tools), except for the matrix multiplication part.
The matrices are 100x(242432) and (242432)x1, and it takes about 90us (microseconds) in TensorFlow, while in my code it takes 1.82 ms (1820us). Note that the second ‘matrix’ is actually a vector.
I run the gemm kernel as follows:
cublasSgemm(hnd_cuBLAS, CUBLAS_OP_N, CUBLAS_OP_N, 1, 100, 24*24*32, &alpha, dev_B, 1, dev_A, 24*24*32, &beta, dev_C, 1);
where the matrices are in reverse order (basically compute B*A as suggested in the samples). The outputs are correct (element-by-element comparison + L1 norm). Is there something else I should keep into account?
UPDATE: nevermind, solved by correct use of the T op.