Float3 Matrix Multiplication

I want to perform matrix multiplication where elements are of float3 type.

I can modify the MatrixMul.cu SDK sample code (which benefits from shared memory) to achieve this however the CUBLAS library outperform MatrixMul.cu code by at least 4x quicker. I looked at CUBLAS library and the api doesn’t seems to accept float3 type for the elements. Is there a built-in cuda api for this kind of multiplication ?