Multiply large matrices with cublasSgemm

Hi all,

I want to multiply two large matrices. I have GeForce GTX 950 and OS Windows. This is my attempt with cublasSgemm from cublas_v2.h :

#include <cstdlib>
#include <cuda_runtime.h>
#include <cstdio>
#include <curand_kernel.h>
#include <ctime>
#include <cublas_v2.h>
#include <iostream>
#include <string>

#define MAX 10

// Fill the array A(nr_rows_A, nr_cols_A) with random numbers on GPU
void GPU_fill_rand(float *A, int nr_rows_A, int nr_cols_A) {
	// Create a pseudo-random number generator
	curandGenerator_t prng;
	curandCreateGenerator(&prng, CURAND_RNG_PSEUDO_DEFAULT);

	// Set the seed for the random number generator using the system clock
	curandSetPseudoRandomGeneratorSeed(prng, 1234ULL);

	// Fill the array with random numbers on the device
	curandGenerateUniform(prng, A, nr_rows_A * nr_cols_A);

__global__ void randomMatrix(unsigned int seed, float* result_M, int row, int col) {
	curandState_t st;
	curand_init(seed, /* the seed controls the sequence of random values that are produced */
		0, /* the sequence number is only important with multiple cores */
		0, /* the offset is how much extra we advance in the sequence for each call, can be 0 */

	for (int i = 0; i < row; ++i) {
		for (int j = 0; j < col; ++j) {
			result_M[i*col + j] = curand(&st) % MAX;


// Multiply the arrays A and B on GPU and save the result in C
// C(m,n) = A(m,k) * B(k,n)
void gpu_blas_mmul(const float *A, const float *B, float *C, const int m, const int k, const int n) {
	int lda = m, ldb = k, ldc = m;
	const float alf = 1;
	const float bet = 0;
	const float *alpha = &alf;
	const float *beta = &bet;

	// Create a handle for CUBLAS
	cublasHandle_t handle;

	// Do the actual multiplication 
		m, n, k,
		A, lda,
		B, ldb,
		C, ldc);
	// Destroy the handle

int main(int argc, char* argv[])
	int dim = std::stoi(argv[1]); 
	int nr_rows_A, nr_cols_A, nr_rows_B, nr_cols_B, nr_rows_C, nr_cols_C;

	nr_rows_A = nr_cols_A = nr_rows_B = nr_cols_B = nr_rows_C = nr_cols_C = dim;
	float *h_A = (float *)malloc(nr_rows_A * nr_cols_A * sizeof(float));
	float *h_B = (float *)malloc(nr_rows_B * nr_cols_B * sizeof(float));
	float *h_C = (float *)malloc(nr_rows_C * nr_cols_C * sizeof(float));

	float *d_A, *d_B, *d_C;
	cudaMalloc(&d_A, nr_rows_A * nr_cols_A * sizeof(float));
	cudaMalloc(&d_B, nr_rows_B * nr_cols_B * sizeof(float));
	cudaMalloc(&d_C, nr_rows_C * nr_cols_C * sizeof(float));

	randomMatrix<<<1,1>>>(time(NULL), d_A, nr_rows_A, nr_cols_A);
	randomMatrix<<<1, 1>>>(time(NULL) + 100000, d_B, nr_rows_B, nr_cols_B);

	// Multiply A and B on GPU
	gpu_blas_mmul(d_A, d_B, d_C, nr_rows_A, nr_cols_A, nr_cols_B);

	// Copy (and print) the result on host memory
	cudaMemcpy(h_C, d_C, nr_rows_C * nr_cols_C * sizeof(float), cudaMemcpyDeviceToHost);

	//Free GPU memory

	// Free CPU memory

	return 0;

But this code works only for matrices not more that 64 x 64. After cuda-memckeck run, I get next messages:

========= Program hit cudaErrorInvalidDeviceFunction (error 8) due to "invalid device function" on CUDA API call to cudaLaunch.
=========     Saved host backtrace up to driver entry point at error
========= Program hit cudaErrorInvalidDeviceFunction (error 8) due to "invalid device function" on CUDA API call to cudaGetLastError.
=========     Saved host backtrace up to driver entry point at error
========= ERROR SUMMARY: 2 errors

I can’t understand why and how to multiply matrices more large? Please, help me!

I don’t really believe this is the exact code you are running. What you’ve pasted here will not compile. It’s really a lot better if you verify what you’re pasting, if you’re going to claim that it doesn’t work.

Which CUDA version are you using?

Those randomMatrix kernels are going to take a long time to run as the matrix sizes get larger. On windows, that could easily lead to a WDDM TDR timeout (google that).

What happens if you comment those kernel calls out for test purposes? Do you still get runtime errors?

What device(s) are you compiling for? That is, in your windows project settings, what are the arguments for compute_XX and sm_XX ?

Tell me please, why this code will not be compile? Before posting, I compiled and ran it, and it work but only for small matrices.

I use CUDA 8.0.

Nothing change after commenting kernel calls.

It uses parameters for my device: compute_52,sm_52

Same problem I get when run from CUDA Samples (Samples_vs2015)

I compiled the code, as posted, with CUDA 8.0, for sm_52, and got: error: no operator “<<” matches these operands
operand types are: cublasStatus_t << const char [2]

Oops, fixed. Now it compile

If you are serious about wanting help, then:

  1. Copy the code out of your posting above
  2. Start a new, empty CUDA project
  3. Paste the code into that new empty cuda project
  4. Don’t make any other changes
  5. try to compile the code

Then you will be able to answer your own question instead of asking others to tell you that which is obvious, using the tools available to you.

Did you think those who were trying to help you would do something different?

larger Sgemm calls will likely hit the WDDM TDR timeout on windows

Have you modified or disabled the WDDM TDR timeout on your system? (hint: google “WDDM TDR Timeout”)

Indeed, this is my mistake, next time I will be more attentive.

I change value TdrDelay in registry from 2 seconds to 60 seconds and reboot computer but nothing changed.

Is the GPU in question a GTX 950 or a GTX 950M? Note: GTX 950M is not an sm_52 device. GeForce GTX 950M is an sm_50 device. Might be worthwhile to build for sm_50. “invalid device function” error often indicates code that is not compatible with the GPU it is supposed to run on.

On a GTX 950, SGEMM calls would have to operate on very large matrices before they hit the watchdog timer limit. As a guess, around 16K x 16K. Far away from 64 x 64.

Since bugs in CUBLAS are very unlikely (it is one of the most heavily used CUDA libraries, any bugs would be noticed quite quickly), it stands to reason that there is a bug in the OP’s code. Try simplifying the code (e.g. get rid of the random number part) until the problem is pin-pointed.