Matrix Multiplication between CPU and GPU are different(Jcuda, Jcublas)

Hello, I’m using jcuda and jcublas (version 9.2) for matrix operation.
(Also my cuda version is 9.2)
Jama library uses cpu,
Jcublas library uses gpu

Theoretically, those matrix multiplication results are same,
but, They are different.

Do you know reason why those are different,
And how to fix(making same results) it?

Thank you,

here is test code (I used 3x3 matrix)

import Jama.Matrix;
import jcuda.*;
import jcuda.jcublas.*;

public class MatrixMultiplicationTest {

	public static void main(String[] args) {

		double alpha = 1.0;
		double beta = 0.0;
		
		double[][] dummy = { { 9, 8, 7 }, 
							 { 6, 5, 4 }, 
							 { 3, 2, 1 } };

		// Unit Matrix
		double[][] e = { { 1, 0, 0 }, 
						 { 0, 1, 0 }, 
						 { 0, 0, 1 } };

		int row = dummy.length;
		int col = dummy[0].length;

		double h_A[] = new double[row * col];
		// Column major
		h_A[0] = 1;
		h_A[1] = 0;
		h_A[2] = 0;
		h_A[3] = 0;
		h_A[4] = 1;
		h_A[5] = 0;
		h_A[6] = 0;
		h_A[7] = 0;
		h_A[8] = 1;

		double h_C[] = new double[row * col];
		// Column major
		h_C[0] = 0;
		h_C[1] = 0;
		h_C[2] = 0;
		h_C[3] = 0;
		h_C[4] = 0;
		h_C[5] = 0;
		h_C[6] = 0;
		h_C[7] = 0;
		h_C[8] = 0;

		/* Initialize JCublas */
		JCublas.cublasInit();

		
		double h_M[] = new double[row * col];
		// Column major
		h_M[0] = 9;
		h_M[1] = 6;
		h_M[2] = 3;
		h_M[3] = 8;
		h_M[4] = 5;
		h_M[5] = 2;
		h_M[6] = 7;
		h_M[7] = 4;
		h_M[8] = 1;

		JCublas.cublasInit();
		
		/* Allocate host memory for the matrices */
		Pointer d_A = new Pointer();
		Pointer d_M = new Pointer();
		Pointer d_C = new Pointer();
		
		/* Allocate device memory for the matrices */
		JCublas.cublasAlloc(row * col, Sizeof.DOUBLE, d_M);
		JCublas.cublasAlloc(row * col, Sizeof.DOUBLE, d_C);
		JCublas.cublasAlloc(row * col, Sizeof.DOUBLE, d_A);

		/* Initialize the device matrices with the host matrices */
		JCublas.cublasSetVector(row * col, Sizeof.DOUBLE, Pointer.to(h_M), 1, d_M, 1);
		JCublas.cublasSetVector(row * col, Sizeof.DOUBLE, Pointer.to(h_C), 1, d_C, 1);
		JCublas.cublasSetVector(row * col, Sizeof.DOUBLE, Pointer.to(h_A), 1, d_A, 1);
		
		Matrix matrixM = new Matrix(dummy);
		Matrix matrixC = new Matrix(e);

		int count = 0;
		
		JCublas.printMatrix(3, d_C, 3);
		
		while (count < 34) {

			JCublas.cublasDgemm('n', 'n', 3, 3, 3, alpha, d_A, 3, d_M, 3, beta, d_C, 3);
			JCublas.cublasDcopy(row*col, d_C, 1, d_A, 1);
			matrixC = matrixC.times(matrixM);
			count++;

		}
		/* Read the result back */
		JCublas.cublasGetVector(row * col, Sizeof.DOUBLE, d_C, 1, Pointer.to(h_C), 1);

		double[] x = matrixC.getColumnPackedCopy();
    
		System.out.println(x[0] - h_C[0] + " ");
		System.out.println(x[3] - h_C[3] + " ");
		System.out.println(x[6] - h_C[6] + " ");

		System.out.println(x[1] - h_C[1] + " ");
		System.out.println(x[4] - h_C[4] + " ");
		System.out.println(x[7] - h_C[7] + " ");

		System.out.println(x[2] - h_C[2] + " ");
		System.out.println(x[5] - h_C[5] + " ");
		System.out.println(x[8] - h_C[8] + " ");

		/* Memory clean up */
		JCublas.cublasFree(d_M);
		JCublas.cublasFree(d_C);

		/* Shutdown */
		JCublas.cublasShutdown();
	}
}

JAMA : JAMA: Java Matrix Package
I refered to : Floating Point and IEEE 754 :: CUDA Toolkit Documentation

cublas Dgemm doesn’t permit the input matrix to also be the output matrix.

1 Like

Thanks for your comment. I changed my code, but cpu and gpu result was also different…

You’ve changed your code in other ways, as well. You’ve now got a loop doing it 34 times, and copying the output to the input. I have no idea what you are doing. Good luck!

I updated my comment Please Check it. Thanks…!

From your post on Stackoverflow, the results do seem to be matching.

You seem to interpret “matching” as bit-wise identical results. That is not a realistic expectation. Floating-point arithmetic is not associative, so any differences in the order of evaluation will likely lead to differences in the final results. The order of execution created by common parallelization schemes (be it a GPU with CUDA, be it AVX2) will with near certainty cause differences from the execution order of a serial implementation, and the same applies to different parallelization strategies.

Also, the potential contraction of FMUL with dependent FADD into FMA (fused multiply-add) will give rise to such changes. This caveat applies equally to GPUs and modern CPUs.

NVIDIA’s floating-point whitepaper contains a worked example of a dot-product evaluation (as might be used during matrix multiplication):

https://docs.nvidia.com/cuda/floating-point/index.html

1 Like

What is meaning of “Floating-point arithmetic is not associative, so any differences in the order of evaluation will likely lead to differences in the final results.”
Do you mean value difference is not related to Floating-point arithmetic?

And how to change or set FMUL options mode in jcuda(java)?

for example in cuda toolkit documentation there are
image

If you know, Please tell me. Thank you very much :)

1 Like