lower limit of cuBLASSgemm


for my bachelorthesis I want to compare the performance of the cuBLASSgemm to my Version in several inputsizes. For that I need to test a matrix vector multiplikation (input Mat sizes: A=(M,N) B=(N,1) ) computed by the Sgemm routine !!NOT SGEMV!!

my Code is:

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>
#include <time.h>
#include "cublas_v2.h"

#define M 4096
#define N 1
#define K 4096

void callCuBLASKernel(const float* A, const float*B, float*C){

	float* d_A, *d_B, *d_C;

	cudaMalloc((void**)&d_A, M*K*sizeof(float));
	cudaMalloc((void**)&d_B, N*K*sizeof(float));
	cudaMalloc((void**)&d_C, M*N*sizeof(float));

	cudaMemcpy(d_A, A, M*K*sizeof(float), cudaMemcpyHostToDevice);
	cudaMemcpy(d_B, B, N*K*sizeof(float), cudaMemcpyHostToDevice);

	const float alpha = 1.0f, beta = 0.0f;
	cublasHandle_t handle;


	//execute cuBLAS
	cublasSgemm_v2(handle, CUBLAS_OP_T, CUBLAS_OP_T, M, N, K, &alpha, d_A, K, d_B, M, &beta, d_C, M);	//T for Transpose



	cudaMemcpy(C, d_C, M*N*sizeof(float), cudaMemcpyDeviceToHost);


When I want to test this with N < 32 the Sgemm routine seems not to do anything (with N>32 it works fine). I started the binary with the nvprof and the kernel isn’t listed.

So my question:

Am I doing anything wrong here or is there a lower limit to the Sgemm routine so that it isn’t possible to simulate the SGEMV routine with the SGEMM?



If you want to provide a complete code, I’ll take a look.

You don’t seem to be doing any proper cuda or cublas error checking. Run your code with cuda-memcheck and test the return value of each CUDA and CUBLAS call. I suspect you’ll see all sorts of errors if you do so.

Also indicate which CUDA version you are using.

I don’t think your leading dimension parameter of your B matrix (i.e. the vector) is correct. I think it should be 1 (i.e. N) not M.

Thank you, the problem was the wrong ldb. It works now.