Problems of matrix multiplication With and without CUDA

Hello

I develop a project with CUDA for my studies, and I encounter some troubles.

  1. With the same matrix and the same calculations, results with CUDA are different of results without CUDA, with a margin of error of one thousandth, but just for a part of the results.

  2. I can calculate the multiplication of 2 square matrix of size 1000*1000, but above the nVidia driver crashes, and I can’t find where is the limit of memory or calculations I can have with my card Quadro FX 880M.

My program :

// Test.cpp : entry point

//

#include "stdafx.h"

#include <time.h>

#include <cuda.h>

#include <cuda_runtime_api.h>

#include "timer.h"

// Multiply Md by Nd = Pd, with "col1d" number of columns of Md, "col2d" number of columns of Nd, 

// "max_thread_for_last" number of used threads in the last block

__global__ void MatrixMulKernel(float *Md, float *Nd, float *Pd, int col1d, int col2d, int max_thread_for_last)

{

	if(blockIdx.x != gridDim.x || threadIdx.x < max_thread_for_last) // If this is not the last block or an existing calculation

	{

		int tId = threadIdx.x + blockIdx.x * blockDim.x;

		float sum = 0.0;

		for(int i = 0; i < col1d; i++) // Matrix calculation

			sum += Md[(tId/col2d)*col1d + i] * Nd[(tId%col2d)+(i*col2d)];

		Pd[tId] = sum;

	}

}

int _tmain(int argc, _TCHAR* argv[])

{

	timer temps; // Timer for comparison

	int affichage = 99999, amplitude = 67, nb_float_stockes = 0; // Variables for display and initialisation of matrix

	/* MULTIPLICATION OF MATRIX WITH CUDA */

	temps.Start(); // Timer start

	

	int lig1=1000, col1=1000, taille1 = lig1*col1; // First matrix : lig1*col1

	int lig2=col1, col2=1000, taille2 = lig2*col2; // Second matrix : lig2*col2

	int taille3 = lig1 * col2; // Result : lig1 * col2

	printf("\n - Nombre de threads : %i", taille3); // Number of used threads

	int size1 = taille1*sizeof(float), size2 = taille2*sizeof(float), size3 = taille3*sizeof(float); // Memory size of each matrix

	float *m, *n, *p; // Matrix

	m = new float[taille1];

	n = new float[taille2];

	p = new float[taille3];

	for(int i = 0; i < taille1; i++) // Fill first matrix

		m[i] = (float)(i%amplitude)/(float)amplitude + 0.5;

	for(int i = 0; i < taille2; i++) // Fill second matrix

		n[i] = (float)(i%amplitude)/(float)amplitude + 0.5;

	nb_float_stockes += taille1 + taille2 + taille3; // Display number of floats to stock in memory

	printf("\n - Nombre d'float stockes : %i", nb_float_stockes);

	float *p_m, *p_n, *p_p; // Pointers of matrix

	

	temps.Loop(); // Intermediate time

		printf("Copie..."); // Copy the matrix into GPU memory

	cudaMalloc((void**) &p_m, size1);

	cudaMalloc((void**) &p_n, size2);

	cudaMalloc((void**) &p_p, size3);

	cudaMemcpy(p_m, m, size1, cudaMemcpyHostToDevice);

	cudaMemcpy(p_n, n, size2, cudaMemcpyHostToDevice);

	temps.Loop();

	int nbThreads = 256, nbBlocks = (taille3 - 1)/nbThreads; // Calculation of distribution of blocks into the grid and threads into a block

	dim3 dimGrid(nbBlocks + 1, 1, 1);

	dim3 dimBlock(nbThreads, 1, 1);

		printf("\nNombre de blocks : %i, Nombre de threads par block : %i", nbBlocks + 1, nbThreads);

	

	temps.Loop();

		printf("Calcul...");

	MatrixMulKernel<<<dimGrid,dimBlock>>>(p_m, p_n, p_p, col1, col2, taille3%nbThreads + 1); // Execution

	cudaStreamSynchronize(0); // Wait for results

	temps.Loop();

	

		printf("Copie...");

	cudaMemcpy(p, p_p, size3, cudaMemcpyDeviceToHost); // Copy into main memory

	temps.Loop();

	

	cudaFree(p_m);

	cudaFree(p_n);

	cudaFree(p_p);

	for(int i=0; i<taille3; i++) // Display of some of the results

		if(i%affichage == 0)

			printf("\nResultat %i : %f", i, p[i]);

	

	temps.End();

	system("PAUSE");

	

	

	/* SAME MULTIPLICATION OF MATRIX WITHOUT CUDA */

	temps.Start();

	

	float *r; // Matrix

	r = new float[taille3];

	temps.Loop();

		printf("Calcul...");

	for(int i = 0; i < lig1; i++)

	{

		for(int j = 0; j < col2; j++)

		{

			r[i*col2 + j] = 0;

			for(int k = 0; k < col1; k++)

				r[i*col2 + j] += m[i*col1 + k] * n[k*col2 + j];

		}

	}

	temps.Loop();

	for(int i=0; i<taille3; i++) // Display of some of the results

		if(i%affichage == 0)

			printf("\nResultat %i : %f", i, r[i]);

	delete r;

	temps.End();

	system("PAUSE");

	return 0;

}

At left : results with CUDA. At right : results without CUDA.

If the image doesn’t display : http://www.hostingpics.net/viewer.php?id=212555CUDA.png

Thank you

Hello

I develop a project with CUDA for my studies, and I encounter some troubles.

  1. With the same matrix and the same calculations, results with CUDA are different of results without CUDA, with a margin of error of one thousandth, but just for a part of the results.

  2. I can calculate the multiplication of 2 square matrix of size 1000*1000, but above the nVidia driver crashes, and I can’t find where is the limit of memory or calculations I can have with my card Quadro FX 880M.

My program :

// Test.cpp : entry point

//

#include "stdafx.h"

#include <time.h>

#include <cuda.h>

#include <cuda_runtime_api.h>

#include "timer.h"

// Multiply Md by Nd = Pd, with "col1d" number of columns of Md, "col2d" number of columns of Nd, 

// "max_thread_for_last" number of used threads in the last block

__global__ void MatrixMulKernel(float *Md, float *Nd, float *Pd, int col1d, int col2d, int max_thread_for_last)

{

	if(blockIdx.x != gridDim.x || threadIdx.x < max_thread_for_last) // If this is not the last block or an existing calculation

	{

		int tId = threadIdx.x + blockIdx.x * blockDim.x;

		float sum = 0.0;

		for(int i = 0; i < col1d; i++) // Matrix calculation

			sum += Md[(tId/col2d)*col1d + i] * Nd[(tId%col2d)+(i*col2d)];

		Pd[tId] = sum;

	}

}

int _tmain(int argc, _TCHAR* argv[])

{

	timer temps; // Timer for comparison

	int affichage = 99999, amplitude = 67, nb_float_stockes = 0; // Variables for display and initialisation of matrix

	/* MULTIPLICATION OF MATRIX WITH CUDA */

	temps.Start(); // Timer start

	

	int lig1=1000, col1=1000, taille1 = lig1*col1; // First matrix : lig1*col1

	int lig2=col1, col2=1000, taille2 = lig2*col2; // Second matrix : lig2*col2

	int taille3 = lig1 * col2; // Result : lig1 * col2

	printf("\n - Nombre de threads : %i", taille3); // Number of used threads

	int size1 = taille1*sizeof(float), size2 = taille2*sizeof(float), size3 = taille3*sizeof(float); // Memory size of each matrix

	float *m, *n, *p; // Matrix

	m = new float[taille1];

	n = new float[taille2];

	p = new float[taille3];

	for(int i = 0; i < taille1; i++) // Fill first matrix

		m[i] = (float)(i%amplitude)/(float)amplitude + 0.5;

	for(int i = 0; i < taille2; i++) // Fill second matrix

		n[i] = (float)(i%amplitude)/(float)amplitude + 0.5;

	nb_float_stockes += taille1 + taille2 + taille3; // Display number of floats to stock in memory

	printf("\n - Nombre d'float stockes : %i", nb_float_stockes);

	float *p_m, *p_n, *p_p; // Pointers of matrix

	

	temps.Loop(); // Intermediate time

		printf("Copie..."); // Copy the matrix into GPU memory

	cudaMalloc((void**) &p_m, size1);

	cudaMalloc((void**) &p_n, size2);

	cudaMalloc((void**) &p_p, size3);

	cudaMemcpy(p_m, m, size1, cudaMemcpyHostToDevice);

	cudaMemcpy(p_n, n, size2, cudaMemcpyHostToDevice);

	temps.Loop();

	int nbThreads = 256, nbBlocks = (taille3 - 1)/nbThreads; // Calculation of distribution of blocks into the grid and threads into a block

	dim3 dimGrid(nbBlocks + 1, 1, 1);

	dim3 dimBlock(nbThreads, 1, 1);

		printf("\nNombre de blocks : %i, Nombre de threads par block : %i", nbBlocks + 1, nbThreads);

	

	temps.Loop();

		printf("Calcul...");

	MatrixMulKernel<<<dimGrid,dimBlock>>>(p_m, p_n, p_p, col1, col2, taille3%nbThreads + 1); // Execution

	cudaStreamSynchronize(0); // Wait for results

	temps.Loop();

	

		printf("Copie...");

	cudaMemcpy(p, p_p, size3, cudaMemcpyDeviceToHost); // Copy into main memory

	temps.Loop();

	

	cudaFree(p_m);

	cudaFree(p_n);

	cudaFree(p_p);

	for(int i=0; i<taille3; i++) // Display of some of the results

		if(i%affichage == 0)

			printf("\nResultat %i : %f", i, p[i]);

	

	temps.End();

	system("PAUSE");

	

	

	/* SAME MULTIPLICATION OF MATRIX WITHOUT CUDA */

	temps.Start();

	

	float *r; // Matrix

	r = new float[taille3];

	temps.Loop();

		printf("Calcul...");

	for(int i = 0; i < lig1; i++)

	{

		for(int j = 0; j < col2; j++)

		{

			r[i*col2 + j] = 0;

			for(int k = 0; k < col1; k++)

				r[i*col2 + j] += m[i*col1 + k] * n[k*col2 + j];

		}

	}

	temps.Loop();

	for(int i=0; i<taille3; i++) // Display of some of the results

		if(i%affichage == 0)

			printf("\nResultat %i : %f", i, r[i]);

	delete r;

	temps.End();

	system("PAUSE");

	return 0;

}

At left : results with CUDA. At right : results without CUDA.

If the image doesn’t display : http://www.hostingpics.net/viewer.php?id=212555CUDA.png

Thank you

Did you check that the memory access is withing the bounds? If you are running also the operating system on the same card it will use some memory as well. There is a function to check how much free memory is on the device:

size_t free, total;

   printf("\n");

   cudaMemGetInfo(&free,&total);   

   printf("%d KB free of total %d KB at the beginning\n",free/1024,total/1024);

Did you check that the memory access is withing the bounds? If you are running also the operating system on the same card it will use some memory as well. There is a function to check how much free memory is on the device:

size_t free, total;

   printf("\n");

   cudaMemGetInfo(&free,&total);   

   printf("%d KB free of total %d KB at the beginning\n",free/1024,total/1024);

cuda compute on single cpu on double so for me it s ok
even if cuda compute in double it s not the same double than cpu

but 6 or 7 x time better it s not good :) you can have 100x better

on http://developer.nvidia.com/gpu-computing-sdk

http://developer.download.nvidia.com/compute/DevZone/C/Projects/matrixMul.zip

you have a matrix mul you can easy change dimension

cuda compute on single cpu on double so for me it s ok
even if cuda compute in double it s not the same double than cpu

but 6 or 7 x time better it s not good :) you can have 100x better

on http://developer.nvidia.com/gpu-computing-sdk

http://developer.download.nvidia.com/compute/DevZone/C/Projects/matrixMul.zip

you have a matrix mul you can easy change dimension

Let us try with smaller examples. Your matrix has 10x10 elements, your nbThreads = 64. Then you have only two blocks. In the second block, 36 threads should compute, the rest nothing. Threads with index from 0 to 35 should compute something.

From code, the argument for the second block, max_thread_for_last = taille3%nbThreads + 1 = 100%64 + 1 = 37

The second block, all threads with index strictly less than max_thread_for_last, 37, compute. That is threads with index from 0 to 36. So you compute one product + summation more than you should be.

Regards,
Elan.

Thank you for your answers !

I tried with 15001500 matrix, the driver crashed (and results were just wrong), but the program said “914212 KB free of total 1002048 KB”.
So if a float takes 4 bytes = 32 bits, I can stock (914 212 * 1024)/32 = 29254784 floats.
I have 3 matrix of 1000
1000 floats, so 3 000 000 floats. It’s still less than 29 254 784, I don’t undestand why it can’t make the operations properly.

I’ve corrected the mistake by: [font=“Courier New”]max_thread_for_last = taille3%nbThreads[/font]. Thank you for that, even if it doesn’t affect the results, is it?

I still have the problem of divergent results, must I check the operations again? I don’t think it comes from here, because the outcome with CUDA is nearly the same than without.

I hope I can improve my program to have 100 times better performances, but I don’t see if it can help me to make this one work. I will optimize it when I understood the current troubles.

Thank you again !

Is the pgrogram crashing at the allocations or after? What is the meminfo function giving after the allocations?Did you check the memory access with cuda-memcheck?

read that

http://craftac2.free.fr/2067_GTC2010.pdf

i run your pgm take 1180 ms for cuda only the call

i change a little take 43 ms i change only cuda( i take the code of nvidia) the over change its because d ont work on my computer

so now it s100x better than cpu

look at probleme with memory not aligned etc

the idea read all 1 time in shared memory

// Test.cpp : entry point

//

#include <stdio.h>

#include <math.h>

#include <cuda.h>

#include <cuda_runtime.h>

#include <cufft.h>

#include <stdio.h>

#include <math.h>

#include <cuda.h>

#include <cuda_runtime.h>

#include <cufft.h>

#define BLOCK_SIZE 16

#define CHECK_BANK_CONFLICTS 0

#if CHECK_BANK_CONFLICTS

#define AS(i, j) cutilBankChecker(((float*)&As[0][0]), (BLOCK_SIZE * i + j))

#define BS(i, j) cutilBankChecker(((float*)&Bs[0][0]), (BLOCK_SIZE * i + j))

#else

#define AS(i, j) As[i][j]

#define BS(i, j) Bs[i][j]

#endif

#include "cutil_inline.h"

#include <time.h>

#include <cuda.h>

#include <cuda_runtime_api.h>

// Multiply Md by Nd = Pd, with "col1d" number of columns of Md, "col2d" number of columns of Nd, 

// "max_thread_for_last" number of used threads in the last block

__global__ void

MatrixMulKernel( float* C, float* A, float* B, int wA, int wB)

{

    // Block index

    int bx = blockIdx.x;

    int by = blockIdx.y;

// Thread index

    int tx = threadIdx.x;

    int ty = threadIdx.y;

// Index of the first sub-matrix of A processed by the block

    int aBegin = wA * BLOCK_SIZE * by;

// Index of the last sub-matrix of A processed by the block

    int aEnd   = aBegin + wA - 1;

// Step size used to iterate through the sub-matrices of A

    int aStep  = BLOCK_SIZE;

// Index of the first sub-matrix of B processed by the block

    int bBegin = BLOCK_SIZE * bx;

// Step size used to iterate through the sub-matrices of B

    int bStep  = BLOCK_SIZE * wB;

// Csub is used to store the element of the block sub-matrix

    // that is computed by the thread

    float Csub = 0;

// Loop over all the sub-matrices of A and B

    // required to compute the block sub-matrix

    for (int a = aBegin, b = bBegin;

             a <= aEnd;

             a += aStep, b += bStep) {

// Declaration of the shared memory array As used to

        // store the sub-matrix of A

        __shared__ float As[BLOCK_SIZE][BLOCK_SIZE];

// Declaration of the shared memory array Bs used to

        // store the sub-matrix of B

        __shared__ float Bs[BLOCK_SIZE][BLOCK_SIZE];

// Load the matrices from device memory

        // to shared memory; each thread loads

        // one element of each matrix

        AS(ty, tx) = A[a + wA * ty + tx];

        BS(ty, tx) = B[b + wB * ty + tx];

// Synchronize to make sure the matrices are loaded

        __syncthreads();

// Multiply the two matrices together;

        // each thread computes one element

        // of the block sub-matrix

        for (int k = 0; k < BLOCK_SIZE; ++k)

            Csub += AS(ty, k) * BS(k, tx);

// Synchronize to make sure that the preceding

        // computation is done before loading two new

        // sub-matrices of A and B in the next iteration

        __syncthreads();

    }

// Write the block sub-matrix to device memory;

    // each thread writes one element

    int c = wB * BLOCK_SIZE * by + BLOCK_SIZE * bx;

    C[c + wB * ty + tx] = Csub;

}

int main(int argc, char* argv[])

{

        int affichage = 99999, amplitude = 67, nb_float_stockes = 0; // Variables for display and initialisation of matrix

/* MULTIPLICATION OF MATRIX WITH CUDA */

int lig1=1000, col1=1000, taille1 = lig1*col1; // First matrix : lig1*col1

        int lig2=col1, col2=1000, taille2 = lig2*col2; // Second matrix : lig2*col2

        int taille3 = lig1 * col2; // Result : lig1 * col2

        printf("\n - Nombre de threads : %i", taille3); // Number of used threads

        int size1 = taille1*sizeof(float), size2 = taille2*sizeof(float), size3 = taille3*sizeof(float); // Memory size of each matrix

float *m, *n, *p; // Matrix

        m = new float[taille1];

        n = new float[taille2];

        p = new float[taille3];

        for(int i = 0; i < taille1; i++) // Fill first matrix

                m[i] = (float)(i%amplitude)/(float)amplitude + 0.5;

        for(int i = 0; i < taille2; i++) // Fill second matrix

                n[i] = (float)(i%amplitude)/(float)amplitude + 0.5;

nb_float_stockes += taille1 + taille2 + taille3; // Display number of floats to stock in memory

        printf("\n - Nombre d'float stockes : %i", nb_float_stockes);

float *p_m, *p_n, *p_p; // Pointers of matrix

printf("Copie..."); // Copy the matrix into GPU memory

        cudaMalloc((void**) &p_m, size1);

        cudaMalloc((void**) &p_n, size2);

        cudaMalloc((void**) &p_p, size3);

        cudaMemcpy(p_m, m, size1, cudaMemcpyHostToDevice);

        cudaMemcpy(p_n, n, size2, cudaMemcpyHostToDevice);

int nbThreads = 256, nbBlocks = (taille3 - 1)/nbThreads; // Calculation of distribution of blocks into the grid and threads into a block

        dim3 dimGrid(nbBlocks + 1, 1, 1);

        dim3 dimBlock(nbThreads, 1, 1);

                printf("\nNombre de blocks : %i, Nombre de threads par block : %i", nbBlocks + 1, nbThreads);

printf("Calcul...");

float gpu_time = 0.0f;

cudaEvent_t start, stop;

    cutilSafeCall( cudaEventCreate(&start) );

    cutilSafeCall( cudaEventCreate(&stop)  );

 cudaEventRecord(start, 0);

dim3 threads(BLOCK_SIZE, BLOCK_SIZE);

    dim3 grid(col1 / threads.x, col2 / threads.y);

// kernel warmup

              MatrixMulKernel<<< grid, threads >>>(p_m, p_n, p_p, col1, col2);

cudaEventRecord(stop, 0);

    unsigned long int counter=0;

    while( cudaEventQuery(stop) == cudaErrorNotReady )

    {

        counter++;

    }

    cutilSafeCall( cudaEventElapsedTime(&gpu_time, start, stop) );        

    printf("time spent executing by the GPU: %.6f\n", gpu_time);

printf("Copie...");

        cudaMemcpy(p, p_p, size3, cudaMemcpyDeviceToHost); // Copy into main memory

cudaFree(p_m);

        cudaFree(p_n);

        cudaFree(p_p);

        for(int i=0; i<taille3; i++) // Display of some of the results

                if(i%affichage == 0)

                        printf("\nResultat %i : %f", i, p[i]);

system("PAUSE");

/* SAME MULTIPLICATION OF MATRIX WITHOUT CUDA */

float *r; // Matrix

        r = new float[taille3];

printf("Calcul...");

        for(int i = 0; i < lig1; i++)

        {

                for(int j = 0; j < col2; j++)

                {

                        r[i*col2 + j] = 0;

                        for(int k = 0; k < col1; k++)

                                r[i*col2 + j] += m[i*col1 + k] * n[k*col2 + j];

                }

        }

for(int i=0; i<taille3; i++) // Display of some of the results

                if(i%affichage == 0)

                        printf("\nResultat %i : %f", i, r[i]);

        delete r;

system("PAUSE");

        return 0;

}

Thank you for your answers.

Cricri1, your program doesn’t work, even if I correct [font=“Courier New”]MatrixMulKernel( float* C, float* A, float* B, int wA, int wB)[/font] in [font=“Courier New”]MatrixMulKernel( float* A, float* B, float* C, int wA, int wB)[/font].

If I want a program with an optimized multiplication of matrix, I can check the example in the SDK. I search why my program doesn’t work before improve it.

Pasoleatis, the program isn’t crashing, just the display driver, apparently at this line between 2 outputs :

for(int i = 0; i < taille2; i++) // Fill second matrix

	n[i] = (float)(i%amplitude)/(float)amplitude + 0.5;

However the results are all equal to -431 602 080.000 000 ! With a try of 1200*1200.

cudaMemGetInfo diplays that there is at least 900 000 KB available of over 1 000 000 KB after the CUDA part (with a size of 10001000). If the driver crashs (with a size of 12001200 by example), there is 0 KB free of 0 KB…

-431 602 080.000 000 I ususally get this kind of numbers when I do an error and access out of bounds or uninitialized parts of the memory.

  1. If you wish to debug, you should test with smaller and smaller problems, not bigger ones that are tough to see.

  2. If cricric1’s solution does not work, you should try debugging it. That is how development is done outside academia. You take good code as model, incorporate it in yours, and thereby learn how to code. Using lot of % and / operators to compute array indices is neat and shows the mathematical thought process of the brain. But it is not the best way to implement an algorithm on a computer. Once you get a simpler and straightforward code really working, you can compare with your index calculations to see why it does not work.

  3. Reimplementing the whole stuff again removes bugs often. For example, remove the index checking as in
    if(blockIdx.x != gridDim.x || threadIdx.x < max_thread_for_last)
    Try something else like
    int nThread = blockDim.x*blockIdx.x + threadIdx.x;
    if ( nThread >= nCol1 ) return;

Pasoleatis, I agree with you. I didn’t say it but I knew.
I still have the problem of memory, but I don’t understand why. The numbers are short enough to be in integer variables, and there are enough memory space available in theory.

Elan

  1. I have the problems only with big matrix, with 22 matrix the results are the same, and the driver don’t crash. Not 8080 on some of the results…
    New : on another PC (with a GT 555M) the results are always equal, with the same program, even in 10001000. Does it come from my video card? However the video driver still crashes over 12001200 on the another PC.

  2. I tried debugging cricri1’s program by modifying some lines but I don’t see the interest of debugging another program before mine ? I am in a project to understand CUDA mechanisms, not to have a full-optimized program.

  3. Thank you for this improvement, I will implement it as soon as possible.

Thank you

Hi Natrem,

  1. Ok. I ran both your program and cricri1’s program on my PC. I have a GTX 480. I get identical results in all 4 cases, namely, yours/his on GPU/on CPU. I did not even correct max_thread_for_last, or still worse, the argument order in cricri1’s code. Still, all seem to work. The results are the same as the left side window in the picture you posted. I am stumped. One possible suspect would be that there is not enough shared memory. But you are not using any of it. You must have enough mainmemory, from the device query you posted.

  2. Debugging cricri1’s program is just an exercise. You get some practice, and may find the bug in your program when you analyze his. Of course, getting right result is the higher priority.

  3. Mine is not an improvement. Just using different logic may remove bugs that are not obvious in the first look.

I will list some things you can try. I would have done them myself, if only the problem had appeared in my program too.

  1. Check for cuda error after memory copy, kernel call etc. The function you can use call is:
void checkCUDAError(const char* msg) {

cudaError_t err = cudaGetLastError();

  if (cudaSuccess != err) {

    fprintf(stderr, "Cuda error at or before: %s: %s.\n", msg, cudaGetErrorString(err));

    exit(EXIT_FAILURE);

    getchar();

  }

}

The call can be checkCUDAError( “just copied matrix 1 to GPU” ) etc.

  1. See if the path to the SDK points to SDK version 4. In windows, the default setting pointed to an older version. Check the compute capability. In Visual Studio, you will have to use compute_20,sm_20 in Project Settings->Configuration Properties->CUDA C/C+±> Device-> Code generation. The default value was an older one, compute_10,sm_10

  2. Use the device query example in SDK to find out how many thready you can run maximum per block. If you can run your kernel with just one thread, one block. Just

MatrixMulKernel<<< 1, 1 >>>(…

  1. Look into SDK examples to find functions like cudaInitDevice() and cudaCloseDevice() . The first one has a cudaSetDevice() call, and the later cudaDeviceReset() call. Include them in your program too.

  2. You might implement just another loop the calculate the L2 norms of both the vectors, and the differences. This is just a sanity check, to see if the difference is not too big at some positions you are not looking at. But this can wait till the crash is fixed.

All the best,

Elan.

as i say in my first reply 699993 974,52063 its good

i have the same result because my card graphique can compute only float simple precision like your s i thinks
when copy gpu to memory it s become float double precision
the cpu do all in double precision

single precision have only 6 or 7 number so %g and not %f