sgemm - crashing at 1024x1024

Hey guys…

I am using the CUBLAS func sgemm and wanted to do a little speed test. However, I am able to do tenthousands of iterations with 512x512 matrices but I am not able to do even 10 with 1024x1024 matrices. I get a display drive error and my program exits. I really do not understand why I am able to do 1 iteration with 1024x1024 and not 20 or 30. I think he got problems when copying the data to GPU-Memory, but I don’t understand how the num of iterations affects this.

#include<stdio.h>

#include<cuda.h>

#include<cublas.h>

#include<stdlib.h>

#include<time.h>

#include<cutil_inline.h>

#define QM 1024

#define W_A QM

#define H_A QM

#define W_B QM

#define H_B W_A

#define W_C W_B

#define H_C H_A

#define ITERS 20

int main(int argc, char **argv) {

	clock_t start, end;

	double diff, diffs, speed;

	

	cublasStatus stat;

	

	printf("|=================================|\n");

	printf("|CUBLAS FUNCTION SGEMM COMPUTATION|\n");

	printf("|=================================|\n\n");

	

	

	stat = cublasInit();

	

	if(stat == CUBLAS_STATUS_SUCCESS) {

		printf("Cublas successfully initialized ..\n");

	}

	

	else if(stat != CUBLAS_STATUS_SUCCESS) {

		printf("Cublas not initialized ..\n");

	}

	

	// allocate device memory

	float* d_A;

	

	stat = cublasAlloc(W_A * H_A, sizeof(float), (void**) &d_A);

	

	if(stat != CUBLAS_STATUS_SUCCESS) {

		printf("Mem for m A not get ..\n");

	}

	

	float* d_B;

	

	stat = cublasAlloc(W_B * H_B, sizeof(float), (void**) &d_B);

	

	

	if(stat != CUBLAS_STATUS_SUCCESS) {

		printf("Mem for m B not get ..\n");

	}

	

	// allocate device memory for result

	size_t size_C = W_C * H_C;

	size_t mem_size_C = sizeof(float) * size_C;

	float* d_C;

	stat = cublasAlloc(H_C*W_C, sizeof(float), (void**) &d_C);

	

	if(stat != CUBLAS_STATUS_SUCCESS) {

		printf("Mem for m C not get ..\n");

	}

	

	printf("Device Memory successfully allocated ..\n");

	

	// allocate host memory for matrices A and B

	size_t size_A = W_A * H_A;

	size_t mem_size_A = sizeof(float) * size_A;

	float* h_A = (float*) malloc(mem_size_A);

	size_t size_B = W_B * H_B;

	size_t mem_size_B = sizeof(float) * size_B;

	float* h_B = (float*) malloc(mem_size_B);

	printf("Host Memory successfully allocated ..\n");

	for(int i = 0; i < W_A * H_A; i++) {

			h_A[i] = 1.0f;

	}

	

	for(int i = 0; i < W_B * H_B; i++) {

			h_B[i] = 2.0f;

	}

	// allocate host memory for the result

	float* h_C = (float*) malloc(mem_size_C);

	cublasSetMatrix(H_A, W_A, sizeof(float), h_A, H_A, d_A, H_A);

	cublasSetMatrix(H_B, W_B, sizeof(float), h_B, H_B, d_B, H_B);

	

	printf("Data successfully copied to GPU-Memory ..\n");

	//Warmup

	printf("\nPerforming Warmup ..\n");

	for(int it = 0; it < ITERS; ++it) {

		cublasSgemm('n', 'n', H_C, W_C, W_A, 1.0f, d_A, H_A, d_B, H_B, 1.0f, d_C, H_C);

	}

	

	cutilSafeCall( cudaThreadSynchronize() );

	

	printf("Done.\n\n");

	

	printf("Performing Computation ..\n");

	

	start = clock();

	

	for(int it = 0; it < ITERS; ++it) {

		cublasSgemm('n', 'n', H_C, W_C, W_A, 1.0f, d_A, H_A, d_B, H_B, 1.0f, d_C, H_C);

	}

	

	cutilSafeCall( cudaThreadSynchronize() );

	end = clock();

	printf("Done.\n\n");

	//cublasGetMatrix(H_C, W_C, sizeof(float), d_B, H_B, h_B, H_B);

	

	diff = end - start;

	diffs = diff / CLOCKS_PER_SEC;

	speed = (ITERS * 1E-9 * ((2*QM*QM*QM) + (3*QM*QM))) / diffs;

	

	printf("\nRESULTS OF COMPUTING:\n\n");

	printf("Matrix size: %dx%d\n", H_A, W_A);

	printf("Num Iterations: %d\n", ITERS);

	printf("Elapsed Time [s]: %3f\n", diffs);

	printf("Speed [GFLOP/s]: %2f", speed);

	cublasFree(d_A);

	cublasFree(d_B);

	cublasFree(d_C);

	

	free(h_A);

	free(h_B);

	free(h_C);

	

	cublasShutdown();

}

Regards.

Runs correctly on my ubuntu system with 2000 iterations. How much time does it take on yours? Maybe it times out and you need to disable the watchdog timer in windows.

N.

If you are running on a relatively modest gpu, I am going to take a wild guess that you are hitting the watch dog timer limit. And I am going to further guess that while you might be launching 10 SGEMM kernels, you probably aren’t actually getting any to run to completion. The first one is getting killed off by the watchdog timer, then you program loses its context and exits.

Hmm its not even running 3 seconds till crashing, it says “the display driver dvlddmkm is not reacting…” … can there be some negative side effects when disabling the watchdog timer?

First of all thanks for the fast responses … you might be right , my GeForce 8400M GS is not the newest on market. Sometimes it reaches the warmup loop doing some actions and crashes, sometimes it not even reaches the warmup loop.

You have only got 16 cores sitting on a 64 pin memory bus with that part. I can easily imagine 1024x1204 SGEMM taking 5 seconds on your GPU.

I don’t understand how that code can fail in the way you describe before the first SGEMM call, however. Either the device memory allocations fail because of lack of available memory (and you should see your memory diagnostic messages in such a case), or they don’t and the failure comes during the warm up kernel launches. Either way the malloc call should only take a few tens microseconds and shouldn’t interact with the active display driver. It sound like either you diagnosis is a little off, or you have some serious underlying operating system instabilities. What OS are you using?

I am using Windows Vista Home Premium

Error:
Performing Warmup …
cudaSafeCall() Runtime API error in file <C:.…\sgemm.cu> line 103: the launch was timed out and terminated.

Line 103: cutilSafeCall( cudaThreadSynchronize() ); just after the warmup…

After this it pop ups: the display driver dvlddmkm is not reacting and was successfully restarted …

Exactly as I suggested - the first warm up kernel is timing out and getting killed by the display watchdog timer.

The cudaThreadSynchronize call is returning the last error, which is coming from the first running warm up kernel. CUBLAS calls are asynchronous. All that happens is ITERS SGEMM calls are getting queued up, the first one runs while your cudaThreadSynchronize() sits in a spinlock waiting for the kernel(s) to finish and the first one is killed and you see the error at line 108.

You are either going to have to work out how to disable the watchdog timer, or content yourself with using smaller kernels.