cublasZgemm() gives false result for large data and potential bug

Hi guys, my test case for this to occur is finding y = x’*x where x is a column vector 56000 x 1 complex elements, x’ is conjugate transpose of x. The result is expected to be a real number, no imaginary part present (or =0). The test works fine for an example with small vector. However, with a large vector produced from matlab, matlab gave a correct result while cuda code only gave a correct real element but incorrect imaginary part (not only != 0 but also >>>0). For verification, I export this vector from matlab to 2 files, one for cuda code and one to read it back to matlab, then process on this same set of data. And the result is still the same as my description above. Can someone spot any error in my code or is this an error born from the difference between GPU’s and CPU’s way of working on double type variable. Below are C++ and matlab code for testing:
C++:

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>
#include <cublas_v2.h>
#include <math.h>
#include <stdlib.h>
#include <string.h>

bool ReadFile(char* MyFile, cuDoubleComplex *array){
	FILE* fi = fopen(MyFile, "rt");
	if (fi != NULL){
		int count = 0;
		double temp;
		while (!feof(fi)){
			fscanf(fi, "%lf", &temp);
			array[count].x = temp;
			fscanf(fi, "%lf", &temp);
			array[count].y = temp;
			++count;
		}
		fclose(fi);
		return 0;
	}
	return 1;
}

int main()
{
	cuDoubleComplex *h_x = new cuDoubleComplex[56000];
	// Stop program if any read op error
	if (ReadFile("myFile6.txt", h_x)){
		printf("Read file fail!");
		return 1;
	}
	
	cuDoubleComplex h_y;
	cuDoubleComplex alpha;
	alpha.x = 1; alpha.y = 0;
	cuDoubleComplex beta;
	beta.x = 0; beta.y = 0;

	cuDoubleComplex *d_x;
	cuDoubleComplex *d_y;

	cudaError_t cudaStatus;
	cublasStatus_t stat;
	cublasHandle_t handle;

	// Choose which GPU to run on, change this on a multi-GPU system.
	cudaStatus = cudaSetDevice(0);
	if (cudaStatus != cudaSuccess) {
		fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?");
		goto Error;
	}

	// Allocate GPU buffers for three vectors (two input, one output)    .
	cudaStatus = cudaMalloc((void**)&d_x, 56000 * sizeof(cuDoubleComplex));
	if (cudaStatus != cudaSuccess) {
		fprintf(stderr, "cudaMalloc failed!");
		goto Error;
	}
	cudaStatus = cudaMalloc((void**)&d_y, sizeof(cuDoubleComplex));
	if (cudaStatus != cudaSuccess) {
		fprintf(stderr, "cudaMalloc failed!");
		goto Error;
	}
	// cuBLAS handler
	stat = cublasCreate(&handle);
	if (stat != CUBLAS_STATUS_SUCCESS) {
		printf("CUBLAS initialization failed\n");
		return EXIT_FAILURE;
	}

	// Copy input vectors from host memory to GPU buffers.
	cudaStatus = cudaMemcpy(d_x, h_x, 56000 * sizeof(cuDoubleComplex), cudaMemcpyHostToDevice);
	if (cudaStatus != cudaSuccess) {
		fprintf(stderr, "cudaMemcpy failed!");
		goto Error;
	}

	stat = cublasZgemm_v2(handle, CUBLAS_OP_C, CUBLAS_OP_N,
		1, 1, 56000,
		&alpha, d_x, 56000,
		d_x, 56000, &beta,
		d_y, 1);

	// Copy output vector from GPU buffer to host memory.
	cudaStatus = cudaMemcpy(&h_y, d_y, sizeof(cuDoubleComplex), cudaMemcpyDeviceToHost);
	if (cudaStatus != cudaSuccess) {
		fprintf(stderr, "cudaMemcpy failed!");
		goto Error;
	}

	printf("y = x'*x = %f %fi\n",
		h_y.x, h_y.y);

Error:
	cudaFree(d_x);
	cudaFree(d_y);

	// cudaDeviceReset must be called before exiting in order for profiling and
	// tracing tools such as Nsight and Visual Profiler to show complete traces.
	cudaStatus = cudaDeviceReset();
	if (cudaStatus != cudaSuccess) {
		fprintf(stderr, "cudaDeviceReset failed!");
		return 1;
	}

	return 0;
}

Matlab:

fid = fopen ('myFile7.txt','r'); 
ii  = 1;
while ~feof(fid)
    x(ii, :) = str2num(fgets(fid));
    ii = ii + 1;
end
fclose(fid);
y = x'*x;

Thanks so much for your support!
myFile7.txt (2.92 MB)
myFile6.txt (2.9 MB)

what compile command are you using the compile the CUDA code?

what GPU are you running on?

which CUDA version are you using?

what is the OS you are running on?

I’m not familiar with compiling a program via command prompt. I simply follow the instruction here: https://www.youtube.com/watch?v=2EbHSCvGFM0
to create a project --> Modify/Add some attributes in project properties like: Additional included directory, Code Generation, Additional dependencies… to make it the same as Samples*.sln

Mine is Quadro K4200 and Geforce GT740M (for alternate comparison)

I install CUDA Toolkit 9.1.85 (update3) and working on Visual Studio 2013 Ultimate update 5

Windows 7 x64 for K4200 and Windows 8.1 x64 for GT740M case

Hi everyone!
I have test it with different function and this time, it gave another result (still true&false but slightly more accurate).
In stead of treating this as a multiplication of 1by56000 matrix with 56000by1 matrix and use cuBLASgemm(), I use cuBLASgemv() and treat them as a multiplication of 1by56000 matrix with a vector. Here is my code for that:

stat = cublasZgemv_v2(handle, CUBLAS_OP_C,
		56000, 1,
		&alpha, d_x, 56000,
		d_x, 1, &beta,
		d_y, 1);

I attach both captured results below.
On the other hand, I probably discover a serious bug with most of provided functions in cuBLAS library. It could be because of my usage of those function is wrong so let me explain and give my example:
With those functions which accept a passing of scalar parameter alpha/beta from both host/device memory (as in description of cuBLAS lib doc), they only work if these parameters are passed from host memory (for ex: &alpha/&beta in my code). If I transfer them to device memory and and passing them as a normal pointer d_alpha/d_beta, they will give error as my captured pic below. Ive tested it with Zgemm(), Zgemv() and Zaxpy(), all of them yield the same issue.
I post it here so that everyone can confirm my speculation before I file a bug. Here is a simple example code to verify my finding:

#include "cuda_runtime.h"
#include "device_launch_parameters.h"

#include <stdio.h>
#include <cublas_v2.h>
#include <math.h>
#include <stdlib.h>


int main()
{
	cuDoubleComplex *h_x = new cuDoubleComplex[4];
	for (int i = 0; i < 4; i++){
		h_x[i].x = i; h_x[i].y = i + 1;
	}
	cuDoubleComplex *h_y = new cuDoubleComplex[4];
	for (int i = 0; i < 4; i++){
		h_y[i].x = 0; h_y[i].y = 0;
	}
	cuDoubleComplex h_alpha;
	h_alpha.x = 5; h_alpha.y = 0;
	cuDoubleComplex *d_alpha;

	cuDoubleComplex *d_x;
	cuDoubleComplex *d_y;

	cudaError_t cudaStatus;
	cublasStatus_t stat;
	cublasHandle_t handle;

	// Choose which GPU to run on, change this on a multi-GPU system.
	cudaStatus = cudaSetDevice(0);
	if (cudaStatus != cudaSuccess) {
		fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?");
		goto Error;
	}

	// Allocate GPU buffers for three vectors (two input, one output)    .
	cudaStatus = cudaMalloc((void**)&d_x, 4 * sizeof(cuDoubleComplex));
	if (cudaStatus != cudaSuccess) {
		fprintf(stderr, "cudaMalloc failed!");
		goto Error;
	}
	cudaStatus = cudaMalloc((void**)&d_y, 4 * sizeof(cuDoubleComplex));
	if (cudaStatus != cudaSuccess) {
		fprintf(stderr, "cudaMalloc failed!");
		goto Error;
	}
	cudaStatus = cudaMalloc((void**)&d_alpha, sizeof(cuDoubleComplex));
	if (cudaStatus != cudaSuccess) {
		fprintf(stderr, "cudaMalloc failed!");
		goto Error;
	}
	// cuBLAS handler
	stat = cublasCreate(&handle);
	if (stat != CUBLAS_STATUS_SUCCESS) {
		printf("CUBLAS initialization failed\n");
		return EXIT_FAILURE;
	}

	// Copy input vectors from host memory to GPU buffers.
	cudaStatus = cudaMemcpy(d_x, h_x, 4 * sizeof(cuDoubleComplex), cudaMemcpyHostToDevice);
	if (cudaStatus != cudaSuccess) {
		fprintf(stderr, "cudaMemcpy failed!");
		goto Error;
	}
	cudaStatus = cudaMemcpy(d_y, h_y, 4 * sizeof(cuDoubleComplex), cudaMemcpyHostToDevice);
	if (cudaStatus != cudaSuccess) {
		fprintf(stderr, "cudaMemcpy failed!");
		goto Error;
	}
	cudaStatus = cudaMemcpy(d_alpha, &h_alpha, sizeof(cuDoubleComplex), cudaMemcpyHostToDevice);
	if (cudaStatus != cudaSuccess) {
		fprintf(stderr, "cudaMemcpy failed!");
		goto Error;
	}

	// Change &h_alpha to d_alpha will produce error here!
	stat = cublasZaxpy_v2(handle, 4,
		&h_alpha, d_x, 1,
		d_y, 1);

	// Copy output vector from GPU buffer to host memory.
	cudaStatus = cudaMemcpy(h_y, d_y, 4 * sizeof(cuDoubleComplex), cudaMemcpyDeviceToHost);
	if (cudaStatus != cudaSuccess) {
		fprintf(stderr, "cudaMemcpy failed!");
		goto Error;
	}

	printf("y = alpha*x + y = [%f %fi, %f %fi, %f %fi, %f %fi]\n",
		h_y[0].x, h_y[0].y, h_y[1].x, h_y[1].y, h_y[2].x, h_y[2].y, h_y[3].x, h_y[3].y);

Error:
	cudaFree(d_x);
	cudaFree(d_y);

	// cudaDeviceReset must be called before exiting in order for profiling and
	// tracing tools such as Nsight and Visual Profiler to show complete traces.
	cudaStatus = cudaDeviceReset();
	if (cudaStatus != cudaSuccess) {
		fprintf(stderr, "cudaDeviceReset failed!");
		return 1;
	}

	return 0;
}

Zgemv.PNG
Zgemm.PNG
BigBug.PNG

Zgemv.PNG

Zgemm.PNG

BigBug.PNG

Yes, by default, scalar parameters in CUBLAS are expected to be resident in host memory.

You need to make some adjustments if you want them to originate in device memory:

http://docs.nvidia.com/cuda/cublas/index.html#scalar-parameters

Regarding the original issue, I am starting to think this is just testing the limits of double precision. The result should be exactly zero for the imaginary part, but since the scale of operations here is producing a result that is on the order of 1e24:

octave/matlab:

y = 1.4835e+24

CUBLAS:

y = x’*x = 1483539671550920342110208.000000 4353670.387162i

The “residual” in the imaginary part of the CUBLAS result is on the order of 1e6, i.e. it is 1e-18 smaller than the order of the result in the real component.

I suspect this is due to the limits of double precision arithmetic, along with order-of-operations, which will be different between GPU and CPU, as well as different between differing GPU implementations.

You may wish to file a bug at developer.nvidia.com

However I’m not sure this is a bug.

Since this is effectively a complex vector dot-product, we are multiplying each element by its complex conjugate, and then summing the results over a vector length of 56000. The average “error” in this case for the imaginary part is ~77 per element (4353670.387162/56000)

An element in your input file looks like this:
172882035.799001870000000 -153707310.242315410000000

In my experience, double-precision arithmetic is good for about 10-12 digits of precision. Doing a complex multiply on the above element involves 2 sets of 2 partial products, added together. Each partial product will be on the order of 1e8*1e8, i.e. 1e16. If we have an average error of 77 per element, this is completely plausible in my view, when summing partial products on the order of 1e16.

1 Like

Hi txbob,
thanks so much for your explanation! I think it would be better if the document include information about this “switch” in their function description (a note or a link at least) to warn some newbie who might just jump in the middle like me! :)

About the error with the double-precision arithmetic, I agree that it is expect-able for such a large set of data like this. It actually causes a variation of 10-3 to 10-2 in my final result compared with matlab.
I wonder if there is a solution to work around this besides explicitly assigning zero to the imaginary part?!

Oh, and I don’t understand the line:“Notice that the last (two) equations reflect 1-based indexing used for compatibility with Fortran.” in some function descriptions of cuBLAS lib doc mean (e.g asum() or axpy()…). Which/what equation(s)?