cublasSgeam: Matrix Transpose Issue

Hello there,
while playing around with cublasSgeam I encountered a problem when trying to transpose the first-argument martix A.

This is what i do: i´m having a vector in device memory set by cublasSetVector() consisting of 256 elements and a matrix consiting of 5000 columns (since it is column-major format) of this vector, so the matrix is 256x5000 (RxC). The matrix is set by cublasSetVectorAsync() and the whole thing is working fine, as i have tested it by copying the matrix back to device memory with cudaMemcpy() and cublasGetMatrix(). Since the cuBLAS-library uses column-major format, at some point in my code i´d like to transpose the matrix, so i´m testing the functionality of this.
And here´s the issue: when using cublasSgeam() with CUBLAS_OP_T for matrix A i´m always getting the CUBLAS_STATUS_INVALID_VALUE status and no execution is taking place at all, while using CUBLAS_OP_N works fine.

The cublasSgeam call with CUBLAS_OP_N (cudaTest() and cublasTest() just print the error and the given string):

void printBlDMatrix(const float *matrix, const int rows, const int cols)
{
	cudaDeviceSynchronize();
	cudatest(cudaPeekAtLastError());
	float *h_out, *d_out;
	const float a=1, b=0;
	h_out = new float[rows*cols];
	cudaMalloc(&d_out, sizeof(float) * rows*cols);
	cublasHandle_t handle;
	cublasCreate(&handle);
	cudaDeviceSynchronize();
	cudatest(cudaPeekAtLastError(), "printMatrix00:") ;
	cublastest(cublasSgeam(handle,
				CUBLAS_OP_N, CUBLAS_OP_N,
				rows, cols, 
				&a, 
				matrix, rows, 
				&b, 
				m, rows, 
				d_out, rows), "printMatrix01:" );
	cudaDeviceSynchronize();
	cudatest(cudaPeekAtLastError(), "printMatrix02: ");
	cublastest(cublasGetMatrix(rows, cols, 4, d_out, rows, h_out, rows), "printMatrix03: ");
	for(unsigned int i=0; i < 1024; i++) {
		if(i%r == 0) std::cout << std::endl;
		std::cout << i << ":" << h_out[i] << " ";;
	}
	cudaDeviceSynchronize();
	cudatest(cudaPeekAtLastError());
	cublasDestroy(handle);
	delete h_out;
	cudaFree(d_out);
}

this works fine!

The cublasSgeam call with CUBLAS_OP_T:

void printBlDMatrix(const float *matrix, const int rows, const int cols)
{
	cudaDeviceSynchronize();
	cudatest(cudaPeekAtLastError());
	float *h_out, *d_out;
	const float a=1, b=0;
	h_out = new float[rows*cols];
	cudaMalloc(&d_out, sizeof(float) * rows*cols);
	cublasHandle_t handle;
	cublasCreate(&handle);
	cudaDeviceSynchronize();
	cudatest(cudaPeekAtLastError(), "printMatrix00: ");
	cublastest(cublasSgeam(handle,
				CUBLAS_OP_T, CUBLAS_OP_N,
				cols, rows, 
				&a, 
				matrix, rows, 
				&b, 
				m, rows, 
				d_out, cols), "printMatrix01: ");
	cudaDeviceSynchronize();
	cudatest(cudaPeekAtLastError(), "printMatrix02: ");
	cublastest(cublasGetMatrix(cols, rows, 4, d_out, cols, h_out, cols), "printMatrix03: ");
	for(unsigned int i=0; i< 1024; i++) {
		if(i%r == 0) std::cout << std::endl;
		std::cout << i << ":" << h_out[i] << " ";
	}
	cudaDeviceSynchronize();
	cudatest(cudaPeekAtLastError());
	cublasDestroy(handle);
	delete h_out;
	cudaFree(d_out);
}

[/i] gives me the printMatrix01: CUBLAS_STATUS_INVALID_VALUE,
MATRIX-OUTPUT,
** On entry to SGEAM parameter number 6 had an illegal value and a headache!

I dont care whether the call to cublasGetMatrix() ist correct or not, since the error occours earlier, but i have tried all permutations with rows & cols, e.g.:

cublasSgeam(handle,
	CUBLAS_OP_T, CUBLAS_OP_N,
	cols, rows, 
	&a, 
	matrix, rows, 
	&b, 
	m, rows, 
	d_out, rows), "printMatrix01: ");

printMatrix01: CUBLAS_STATUS_INVALID_VALUE,
MATRIX-OUTPUT,
** On entry to SGEAM parameter number 6 had an illegal value

or

cublasSgeam(handle,
	CUBLAS_OP_T, CUBLAS_OP_N,
	rows, cols, 
	&a, 
	matrix, rows, 
	&b, 
	m, rows, 
	d_out, rows), "printMatrix01: ");

printMatrix01: CUBLAS_STATUS_INVALID_VALUE,
MATRIX-OUTPUT,
** On entry to SGEAM parameter number 5 had an illegal value

As you can see in the output of the last example the “illegal” value changes its possition, as is i permuate the rows&cols arguments. One thing that drives me crazy is, that i dont know, if its beeing counted from 0 or 1; is the matrix the illegal argument?

Could you please help me! I´d like to know, if i just overlook sth., or if this is a bug, so i´m not crazy :)

I´m using VisualStudio 10 Premium on a Windows7 x64 and CudaToolKit 5.0.

Thanks in anticipation,
hanneshansen

The parameter number in errors message are incorrect but you indeed pass INVALID VALUE. We will fix that for the next release. “Parameter number 5” should be “Parameter number 7” ( lda)
“Parameter number 6” should be “Parameter number 10”

WHen you set transpose mode, you need to make sure that the leading dimension ( lda or ldb ) are >= to the number of rows of op(A) or op(B) respectively.

SOLVED!!
Wrong USE of CUBLAS_OP_T, CUBLAS_OP_N because of stupidity :)