Matrix Multiplication Inconsistency Different values output in every run of the matixMul program

I have been having problems with the matrix multiplication program in in the Programmer’s Guide. I enclose the code in the posting:

#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include <time.h>

// Matrices will be strored in row major order
// M(row, col) = *(M.elements + row * M.width + col)(cuda-gdb) n(cuda-gdb) n
typedef struct {
int width;
int height;
float *elements;
} Matrix;

// Thread block size
#define BLOCK_SIZE 16

// Forward declaration of the matrix multiplication kernel
global void MatMulKernel(const Matrix, const Matrix, Matrix);

// Matrix Multiplication - Host code
// Matrix dimensions are assumed to be multiples of BLOCK_SIZE(cuda-gdb) n

void matrixMulGPU(const Matrix A, const Matrix B, const Matrix C)[Current CUDA Thread <<<(0,0),(0,0,0)>>>]
{
// Load A and B to device memory
Matrix d_A;
d_A.width = A.width; d_A.height = A.height;
size_t size = A.width * A.height * sizeof(float);
cudaMalloc((void**)&d_A.elements, size);
cudaMemcpy(d_A.elements, A.elements, size, cudaMemcpyHostToDevice);

Matrix d_B;
d_B.width = B.width; d_B.height = B.height;    
size = B.width * B.height * sizeof(float);
cudaMalloc((void**)&d_B.elements, size);
cudaMemcpy(d_B.elements, B.elements, size, cudaMemcpyHostToDevice);

// Allocate C in device memory
Matrix d_C;
d_C.width = C.width; d_C.height = C.height;    
size = C.width * C.height * sizeof(float);
cudaMalloc((void**)&d_C.elements, size);

//Invoke kernel 
dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
dim3 dimGrid(B.width / dimBlock.x + 1, A.height / dimBlock.y + 1);
MatMulKernel<<<dimGrid, dimBlock>>>(d_A, d_B, d_C); 


// Read C from device memory[Current CUDA Thread <<<(0,0),(0,0,0)>>>]
cudaMemcpy(C.elements, d_C.elements, size, cudaMemcpyDeviceToHost);

// Free device memory
cudaFree(d_A.elements);
cudaFree(d_B.elements);
cudaFree(d_C.elements);

cudaThreadExit();(cuda-gdb) 

}

// Matrix multiplication kernel called by MatMul()
global void MatMulKernel(Matrix A, Matrix B, Matrix C)
{
// Each thread computes one elements of C
// by accumulating results into Cvalue
float Cvalue = 0;
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
for (int e = 0; e < A.width; ++e) {
Cvalue += A.elements[row * A.width + e]
* B.elements[e * B.width + col];
}
__syncthreads();
C.elements[row * C.width + col] = Cvalue;

}

float floatRand(float MaxVal) {
return ( (float)rand() / (float)RAND_MAX ) * MaxVal;
}

void timedate() {
time_t temps_act;
time(&temps_act);
printf(“%s\n”, ctime(&temps_act));
}

void matrixMulCPU(Matrix M, Matrix N, Matrix P) {

int i, j, k;

for(i=0; i<M.height; i++) {
    for(j=0; j<N.width; j++) { 
        for(k=0; k<M.width; k++) {[Current CUDA Thread <<<(0,0),(0,0,0)>>>]
            P.elements[i*N.width + j] += M.elements[i*M.width + k]
                                       * N.elements[k*N.width + j];
        }
    }
}

}

int main() {

int size = 10;
int i, j;
int print = true;
Matrix A;
Matrix B;
Matrix C;
Matrix D;

A.width = size;
A.height = size;
A.elements = (float*) malloc(A.widthA.heightsizeof(float));

B.width = size;(cuda-gdb)
B.height = size;
B.elements = (float*) malloc(B.widthB.heightsizeof(float));

C.width = size;
C.height = size;
C.elements = (float*) malloc(C.widthC.heightsizeof(float));

D.width = size;
D.height = size;
D.elements = (float*) malloc(D.widthD.heightsizeof(float));

printf(“matrix generation\n”);

srand ( time(NULL) );

for(i=0; i<size*size; i++) {
A.elements[i] = 2; //floatRand(2);
B.elements[i] = 2; //floatRand(2);
// A.elements[i] = 3;
// B.elements[i] = 3;
C.elements[i] = 0;
D.elements[i] = 0;

}

if(print) {
printf(“A =\n”);
for(i=0; i<A.height; i++) {
for(j=0; j<A.width; j++) {
printf("%f ",A.elements[i*size + j]);
}

     printf("\n");
  }

}
printf(“\n\n”);
printf(“B =\n”);
for(i=0; i<B.height; i++) {
for(j=0; j<B.width; j++) {
printf(“%f “,B.elements[i*size + j]);
}
printf(”\n”);
}

}   

 timedate();
 printf("GPU matrix multiplication\n");
 matrixMulGPU(A, B, C);
 timedate();

 timedate();
 printf("CPU matrix multiplication\n");
 matrixMulCPU(A, B, D);
 timedate();

 //print result
 if(print) {
     printf("C =\n");
     for(i=0; i<C.height; i++)  {
     for(j=0; j<C.width; j++)  {
         printf("%f ",C.elements[i*size + j]);
      }   
      printf("\n"); 
    }

    printf("\n\n");
    printf("D =\n");
    for(i=0; i<D.height; i++) {
    for(j=0; j<D.width; j++) {
        printf("%f  ",D.elements[i*size + j]);
     }
      printf("\n");

}

}

}

This program gives me different output put every time that I run it. The first time it gives me the correct output, the second time something different is in the C matrix or the matrix multiplication carried out by the GPU. These first two runs were carried out on Friday afternoon. This Monday morning I ran the program and got still different output. Again this is only in the C matrix. It is always in the first six columns and can be either in one, two or three rows. Every matrix here should have 40 in each of its elements. This is what comes out in the D matrix or the matrix that the CPU calculates.

Why does the C matrix change every time that I run the program? I will now show the outputs from last Friday and today.

GPU matrix multiplication
Fri Dec 4 15:24:28 2009

Fri Dec 4 15:24:28 2009

CPU matrix multiplication
Fri Dec 4 15:24:28 2009

C =
40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000
40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000
40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000
40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000
40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000
40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000
40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000
40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000
40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000
40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000

CPU matrix multiplication
Fri Dec 4 15:24:47 2009

C =
40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000
40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000
116.000000 116.000000 116.000000 116.000000 116.000000 116.000000 40.000000 40.000000 40.000000 40.000000
40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000
40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000
40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000
40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000
40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000
40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000
40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000

./example
matrix generation

Mon Dec 7 08:58:56 2009

GPU matrix multiplication
Mon Dec 7 08:58:57 2009

Mon Dec 7 08:58:57 2009

CPU matrix multiplication
Mon Dec 7 08:58:57 2009

C =
40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000
40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000
40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000
40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000
36.000000 36.000000 36.000000 36.000000 36.000000 36.000000 40.000000 40.000000 40.000000 40.000000
40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000
40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000
40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000
36.000000 36.000000 36.000000 36.000000 36.000000 36.000000 40.000000 40.000000 40.000000 40.000000
40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000

Mon Dec 7 08:59:02 2009 CPU matrix multiplication
Fri Dec 4 15:24:47 2009

C =
40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000
40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000
116.000000 116.000000 116.000000 116.000000 116.000000 116.000000 40.000000 40.000000 40.000000 40.000000
40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000
40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000
40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000
40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000
40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000
40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000
40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 (cuda-gdb) n

GPU matrix multiplication CPU matrix multiplication
Fri Dec 4 15:24:47 2009

C =
40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000
40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000
116.000000 116.000000 116.000000 116.000000 116.000000 116.000000 40.000000 40.000000 40.000000 40.000000
40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000
40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000
40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000
40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000
40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000
40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000
40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000
Mon Dec 7 08:59:03 2009

Mon Dec 7 08:59:03 2009

CPU matrix multiplication
Mon Dec 7 08:59:03 2009

C =
40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000
40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000
36.000000 36.000000 36.000000 36.000000 36.000000 36.000000 40.000000 40.000000 40.000000 40.000000
40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000
36.000000 36.000000 36.000000 36.000000 36.000000 36.000000 40.000000 40.000000 40.000000 40.000000
40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000
36.000000 36.000000 36.000000 36.000000 36.000000 36.000000 40.000000 40.000000 40.000000 40.000000
40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000
40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000
40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000 40.000000

Now please understand that no line of code has been changed between these runs, but the run outputs are different for the C matrix in each run of the program. As I stated before the first six columns are affected and they appear randomly; sometimes it is two lines, sometimes it is three lines and sometimes only one line. The numbers in the matrix elements are randomly chosen.

The source for the program is enclosed above, but the issue I believe is with the subprogram MatMulKernel. This is, of course, the GPU kernel. Compiling it for debugging and walking through the code goes for a few lines, then you get.

(cuda-gdb) n
[CUDA execution terminated]
0x00007f72e0b731e0 in ?? () from /usr/lib/libcuda.so

Now if you compile the program in debugging mode and emulation mode the program hangs in the function MatMulKernel, it also hangs when you run it in emulation mode.

I am using Ubuntu 9.04 in 64 bit mode. It says that it cannot access libcuda.so. I did not set up the software, but I do know something about libcuda.so is in the release notes for the 32 bit version of CUDA.

What is going on here?

Newport_j

That code you have posted is full of intermingled debugger output. I doubt anybody could compile it even if they wanted to. If you want to post code, post it in code boxes, because as it stands, it is completely unreadable, both for machine and human.

I have the code in my attached message folder. How do I get it to display so others can see it. This code has no debugger info on it.

Newport_j

just copy and paste it into a code box.

OK. How do I open a code box. I think that I can handle everything past that.

Respectfully,

newport_j

Just reply and paste the code into the reply. Then put and at the top and bottom of the code, but replace the < and > with square brackets (I can’t post the actual mark up because the software will turn them into a code box. Like this one:

A code box.
#include <stdio.h>

#include <stdlib.h>

#include <cuda.h>

#include <cuda_runtime.h>

#include <time.h>

// Matrices will be stroed in row major order

// M(row, col) = *(M.elements + row * M.width + col)

typedef struct  {

	   int width;

	   int height;

	   float *elements;

} Matrix;

// Thread block size

#define BLOCK_SIZE 16

// Forward declaration of the matrix multiplication kernel

__global__ void MatMulKernel(const Matrix, const Matrix, Matrix);

// Matrix Multiplication - Host code

// Matrix dimensions are assumed to be multiples of BLOCK_SIZE

void matrixMulGPU(const Matrix A, const Matrix B, const Matrix C)

{

	// Load A and B to device memory

	Matrix d_A;

	d_A.width = A.width; d_A.height = A.height;	

	size_t size = A.width * A.height * sizeof(float);

	cudaMalloc((void**)&d_A.elements, size);

	cudaMemcpy(d_A.elements, A.elements, size, cudaMemcpyHostToDevice);

	Matrix d_B;

	d_B.width = B.width; d_B.height = B.height;	

	size = B.width * B.height * sizeof(float);

	cudaMalloc((void**)&d_B.elements, size);

	cudaMemcpy(d_B.elements, B.elements, size, cudaMemcpyHostToDevice);

	// Allocate C in device memory

	Matrix d_C;

	d_C.width = C.width; d_C.height = C.height;	

	size = C.width * C.height * sizeof(float);

	cudaMalloc((void**)&d_C.elements, size);

	//Invoke kernel 

	dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);

	dim3 dimGrid(B.width / dimBlock.x + 1, A.height / dimBlock.y + 1);

	MatMulKernel<<<dimGrid, dimBlock>>>(d_A, d_B, d_C); 

	// Read C from device memory

	cudaMemcpy(C.elements, d_C.elements, size, cudaMemcpyDeviceToHost);

	// Free device memory

	cudaFree(d_A.elements);

	cudaFree(d_B.elements);

	cudaFree(d_C.elements);

	cudaThreadExit();

}

// Matrix multiplication kernel called by MatMul()

__global__ void MatMulKernel(Matrix A, Matrix B, Matrix C)

{

	// Each thread computes one elements of C

	// by accumulating results into Cvalue

	float Cvalue = 0;

	int row = blockIdx.y * blockDim.y + threadIdx.y;

	int col = blockIdx.x * blockDim.x + threadIdx.x;

	for (int e = 0; e < A.width; ++e) {

			Cvalue += A.elements[row * A.width + e]

					* B.elements[e * B.width + col];

	} 

	__syncthreads();

	C.elements[row * C.width + col] = Cvalue;

}

float floatRand(float MaxVal) {

	return ( (float)rand()  / (float)RAND_MAX ) * MaxVal;

}

void timedate() {

	time_t temps_act;

	time(&temps_act);

	printf("%s\n", ctime(&temps_act));

}

void matrixMulCPU(Matrix M, Matrix N, Matrix P) {

	int i, j, k;

	for(i=0; i<M.height; i++) {

		for(j=0; j<N.width; j++) { 

			for(k=0; k<M.width; k++) {

				P.elements[i*N.width + j] += M.elements[i*M.width + k]

										   * N.elements[k*N.width + j];

			}

		}

	}

}

int main()  {

int size = 10;

   int i, j;

   int print = true;

   Matrix A;

   Matrix B;

   Matrix C;

   Matrix D;

A.width = size;

   A.height = size;

   A.elements = (float*) malloc(A.width*A.height*sizeof(float));

B.width = size;

   B.height = size;

   B.elements = (float*) malloc(B.width*B.height*sizeof(float));

C.width = size;

   C.height = size;

   C.elements = (float*) malloc(C.width*C.height*sizeof(float));

D.width = size;

   D.height = size;

D.elements = (float*) malloc(D.width*D.height*sizeof(float));

printf("matrix generation\n");

srand ( time(NULL)  );

for(i=0; i<size*size; i++)  {

	   A.elements[i] = 2; //floatRand(2);

	   B.elements[i] = 2; //floatRand(2);

//	   A.elements[i] = 3;

//	   B.elements[i] = 3;

	   C.elements[i] = 0;

	   D.elements[i] = 0;

	}

if(print) {

	  printf("A =\n");

	  for(i=0; i<A.height; i++) {

		  for(j=0; j<A.width; j++) {

			printf("%f ",A.elements[i*size + j]);

		  }

		 printf("\n");

	  }

	  printf("\n\n");

	  printf("B =\n");

	  for(i=0; i<B.height; i++) {

		  for(j=0; j<B.width; j++) {

			printf("%f ",B.elements[i*size + j]);

		  }

		 printf("\n");

	  }

	}   

	 timedate();

	 printf("GPU matrix multiplication\n");

	 matrixMulGPU(A, B, C);

	 timedate();

	 timedate();

	 printf("CPU matrix multiplication\n");

	 matrixMulCPU(A, B, D);

	 timedate();

	 //print result

	 if(print) {

		 printf("C =\n");

		 for(i=0; i<C.height; i++)  {

		 for(j=0; j<C.width; j++)  {

			 printf("%f ",C.elements[i*size + j]);

		  }   

		  printf("\n"); 

		}

	   

		printf("\n\n");

		printf("D =\n");

		for(i=0; i<D.height; i++) {

		for(j=0; j<D.width; j++) {

			printf("%f  ",D.elements[i*size + j]);

		 }

		  printf("\n");

	}

  }

}

Now compiling and running this code causes problems as said in the first message in this thread. The C = matrix will be correct some of the time (each element equals 40) and incorrect most of the time. In this case the first 6 elements in certain rows are no longer 0 ,and I am not sure why. I walk though the matrixmulkernel and it crashes part of the way through. If I compile to emulate, the program will hang in the matrixmul kernel. If I walk through the code in the cuda debugger the emulate program will again hang in the matrixmulkernel. I just do not know what is causing the problem. I expect a program to run and give the same answer every time if if inputs are the same. it makes little sense to have a program operate like this one. Yet is does.

Newport_j

Almost certainly an out of bounds memory access problem in the GPU. It is very easy to get random results when code reads from memory whose contents might be random or uninitialised.

Edit: And that is exactly what is happening here. Ask yourself what the kernel will do with width=10, height=10 and blocksize(16,16)…

I do not see it. I have tried walking through the code and whatever it is, it is not obvious. So just where is the error? I cannot afford the time to look for it any longer.

Newport_j

You have reserved 100 floats (10x10) for your matrices. You are telling the kernel to use a 16x16 block. That means that the kernel is reading and writing 256 floats from/to memory containing only 100 valid values.

To expand a little further, to make this work for arbitrary matrix sizes, rather than even multiples of the block size, you need to either add guarding logic to prevent threads which fall outside the matrix dimensions from participating in the calculations, or you need to zero pad the input matrices to even multiples of the block size. The latter will be faster than for former for non-trivial sizes, but it will use more storage and require an additional zeroing operation on the padding before the kernel is called.

These sort of details are absolutely critical if you are going to write your own kernel code. I would suggest you continue looking at this until you understand how to fix it, because otherwise I doubt you will make much headway.

What you say makes sense. So how do I change my code so that it outputs correctly and consistently?

I am reading the David Kirk, Wen -mei Hwu draft of CUDA programming for a course at U of I. In Chapter 3 they discusses this. What is the relationship between between grids and blocks? Is there a direct one? It seems that I have implemented code that I found online, but is also in the Programming Guide 2.3.1.

How does one select block-size. It seems the argument is circular. I think the independent choice here is Block Size.

Newport_j

That code contains perfectly valid block and grid size selection criteria already. There is no need to do anything to change it. In general, block size is set to optimize hardware resource usage (registers, shared memory, occupancy), and grid size is chosen to cover the total amount of work that needs to be done. The “corner” case is what happens when your problem size doesn’t fit into round multiples of the block/grid size.

As I explained above, you must do one of two things to fix this: either modify the kernel to prevent threads which would operate outside the dimensions of the matrix from doing anything (ie. enclose the calculations in a suitable if statement) or pad the input matrix with zeros so that the matrix size is a round multiple of the block size. The choice is entirely up to you. Both are trivial to implement.

I am just curious, I got this code off the internet (most of it) and I am wondering just what are the dimensions of matrics A, B, C, and D?

I at first thought two dimnesional, but now I believe it is on one dimensional…

Newport_j

If it wasn’t obvious from the code itself, it contains a plain English comment that answers your question. They are matrices (which are, by definition two dimensional) stored in column major order (ie. in linear one dimensional memory):

// Matrices will be stroed in row major order

// M(row, col) = *(M.elements + row * M.width + col)

typedef struct  {

	   int width;

	   int height;

	   float *elements;

} Matrix;

I noticed that if I ran the above code several times (certainly more than once) that when I am walking through the code in the CUDA kernel for a second run that the values inf the C matrix are still there from the previous run. Thus when I initialize everything in the program in main the C.elements are the values from a previous run. They are not the new initial values which I have set to 0 in main. I believe that this must have something to do with the reason that it gives me different values in each run. The C element values (call then residual) are being carried over from a previous run. That would explain why the first run is right and that errors are seen only one the second or third or higher runs.

How does one initialize variables in the CUDA kernel? I take it they must be initialized there if there are used there.

Any help appreciated.

Respectfully,

James M. Yunker

Have you actually tried to read the code?

If you look at matrixMulGPU(), you will see that only the contents of the A and B matrices are ever copied to the device. The C matrix isn’t, it is only copied back after the kernel is executed. So nothing you do to the contents of C matrix on the host as any effect on what happens to C matrix in the device. malloc and cudaMalloc only reserve memory, then don’t zero it or otherwise initialise it. It is absolutely normal C language behaviour to find random garbage, including the contents of previous memory allocations or previous program runs in that memory. If you don’t want that, then you need to explicitly initialise it.

It has been awhile since I took Compilers, but I remember somewhere in there that a good compiler initializes every variable (usually to zero) before beginning program execution. Still it is good programming practice to initialize all variables that will be used in a program - just in case.

Now, whenever I initialized a program I did it to get rid of random junk that “may be still in memory”. In the case we have here, there is no random junk. There is specific junk in that is the old value for the very same variable shown again in the additional run. That seems very odd. I have looked through some old programs (mine and others) and yes they do initialize variables. It is done in a wholesale manner, however.

It seems very odd that the new variable has the old value here. that is my only question - nothing more.

What is it so hard to believe that successive runs of a program in a device which is otherwise completely unoccupied will wind up allocating variables out of the same memory and permit you to see the contents from previous runs? Imagine a hypothetical program requires 10 pages of memory, and the driver/memory manager always allocates from the bottom of the address space. Run the program multiple times, and if there is no other program requesting memory, logic dictates that the program will be allocated those same 10 pages on successive runs.

The C language takes a very minimal approach to memory initialization. If the programmer wants memory initialized, zeroed or anything else at runtime, it must be specifically initialized by the programmer. Neither the compiler or the runtime will do it for you. Similarly, if the programmer doesn’t want the next user of some memory to be able to see its contents, then it must be explicitly erased prior to release. This isn’t a matter of “good” compilers versus “bad” compilers, it is behaviour enshrined in the ANSI/ISO C standards, and is clearly spelled out in the first edition of the K&R book and specifically mentioned in some of the Bell laboratories language documentation and design reports dating from the early 1970s. It is done for efficiency reasons, because memory initialization brings with it a very large overhead, and it is often not necessary anyway. For example, why bother wasting cycles zeroing a large block of memory when the first program action is to use I/O to copy from disk (or perhaps in the CUDA case data from the PCI-e bus) into that memory?

In your example, you are using a source debugger to inspect uninitialised (either by compiler/runtime or by programmer) memory. The behaviour in such a situation is undefined. By design.