Strange "unspecified launch error" from a call to cublas gemm

lyskule · January 17, 2019, 8:19pm

Hi there!

I encountered some really strange error, with host2device memcpy.

There’s no error thrown on cudaMalloc, and the size of the device memory is big enough to hold the data being transferred.

The real strange part is, it encounters no error in the memory checker (VS2015, enabled memory checking, ran in the CUDA debugging mode).

But once I did see some error like “grid launch failure” for one CuBLAS function, this happened only once.

The TDR is set to 3600, so it shouldn’t cause a problem here.

Some system info FWIW, Win10, CUDA10, VS2015, 780Ti card.

Thanks a lot for the help!

njuffa · January 17, 2019, 8:36pm

An “unspecified launch failure” is an error triggered by a CUDA kernel that was terminated abnormally. Errors are sticky until cleared, so if you encounter this error on a cudaMemcpy() that means you are missing proper error checking for a kernel launch upstream from the copy call.

“unspecified launch failure” could be due to an access out of bounds, hitting a GUI watchdog time limit, or any other error condition (thus “unspecified”). An error condition could be transient, such as a cosmic ray knocking out a bit in GPU memory. It could also involve indeterminate behavior, e.g. when there is a race condition.

The key to debugging is reproducibility. It sounds like you are unable to reproduce the error. You could still beef up the error checking for kernel launches and check for race conditions. Note that cuda-memcheck cannot detect all race conditions, and that code behavior may differ between debug and release builds.

saulocpp · January 17, 2019, 9:00pm

In before @tera leaves his signature here.

tera · January 17, 2019, 9:16pm

Why would I want to leave my signature here?

saulocpp · January 17, 2019, 9:45pm

How are your cudaMallocs going?

lyskule · January 17, 2019, 10:01pm

I indeed checked all cuda launches, but there seemed to be no error whatsoever. Let me briefly describe what I’m trying to do here. (some pseudo code here)

cudaMalloc(a, size1);
cudaMalloc(b, size2);

while(iter-- > 0) {
  cudaMemcpyHost2Device(h_data, a, size1);
  cudaMemcpyHost2Device(h_data, b, size2);
  
  cublas_sgemm(a, b);
  
  cublas_sgemm(b, a);   <--------- THIS ONE
  cudaPeekAtLastError();
  cudaDeviceSync();
}

I don’t think there’s racing, as they all happen on default stream, memcpy and compute shouldn’t have any overlapping.

cublas_sgemm(b, a) (on the fourth last line) is doing “good” as I checked its return code. BUT, everything runs smoothly with this line commented out, so this must be the issue.

I guess it might be something with the matrix sizes, but not quite sure.

Any suggestion welcome!

lyskule · January 17, 2019, 10:19pm

On the second gemm, size (m,k,n) 17408 x 32768 x 128 works, but 17408 x 16384 x 128 doesn’t…

Robert_Crovella · January 17, 2019, 10:27pm

cublas gemm is an asynchronous call. Therefore checking its return code doesn’t tell you if there was an execution error.

Run your code with cuda-memcheck

njuffa · January 17, 2019, 10:30pm

Is this observation solidly reproducible, i.e. for a given set of parameter, it always fails at the same iteration? After how many iterations? What are the SGEMM parameters used? Can you show your actual code? The issue may originate in code that you haven’t shown. Maybe your TDR control doesn’t work as intended. You might want to plot matrix size and execution time for increasing size and see where it fails.

I think it fairly safe to exclude a bug in cublasSgemm. This is one of the most heavily used functions across all of NVIDIA’s CUDA libraries, so it sees extensive testing both prior to release and through practical use in the field. What version of CUDA do you use?

If the failures occur kind of randomly, one might hypothesize about any number of failure mechanisms. What kind of GPU is this? GPUs age like all electronics, and eventually they become flaky, then die. Usually the DRAM on the board is the first component to fail, but I have seen a few failures of the actual GPU chip (in big farms of GPU that operate continuously for years).

Is there adequate cooling and power supply for the GPU? Insufficient power supply can lead to temporary local voltage drops (“brown outs”) which cause transistors to slow down and processors to malfunction. Is the GPU overclocked, that is, running at higher clocks than NVIDIA’s reference design? Some GPUs are already massively overclocked by the vendor. My recommendation is to avoid these models for CUDA work.

Failures due to overclocking can be very difficult to diagnose: I once operated an overclocked math co-processor which seemed to be doing fine, except that it occasionally produced a wrong square root result. And best I could tell, it was only the square root operation that was affected.

Note that overclocking and aging can work in combination. As the electronic components age (e.g. through electro migration, hot carrier injection), they operate more slowlt. For this reason electronic components are manufactured with an engineering margin, often 15%-20%, so that they work flawlessly over the intended life span, say five years. Overclocking exploits that engineering margin, leaving razor thin margins that are eaten up after a year or two of use: at that point the part fails.

[Later:] I notice belatedly that the OP already stated that the GPU here is a Geforce 780 Ti. This jogged my memory and I recalled that the AMBER folks found that particular model to be flaky even when new. For example, discussed here:

[url]http://archive.ambermd.org/201411/0374.html[/url]

My memory is hazy, but I think the issue was eventually chalked up to bad GPU memory timings parameters.

lyskule · January 17, 2019, 10:40pm

It’s indeed the cudaDeviceSync() who errored out.
Strangest thing, it runs fine in cuda-memcheck (VS2015, memcheck enabled, pressed “start CUDA debugging”).

TDR is doing as expected, I set that a long time ago and it’s doing well so far (had things ran for over 10 minutes). And yes, I agree this should not be CUBLAS’s problem…

It fails on the first iteration, at the second blas call, always. But you are right, it’s an old card, 780Ti, but it’s not overclocked as such.

My code is a bit messy, let me clean it up before posting it here.

lyskule · January 17, 2019, 11:25pm

Here’s the wrapper for cublas, if it helps. Definition of float_tensor is made up, please don’t mind. This function deals with row major matrices. That is, I have normal row-major matrices sitting in host memory, after I do the computation on GPU and copy it back, it should still be row-major on CPU.

struct float_tensor {
	float_tensor(int rows, int cols): size({rows, cols}) {
		checkedCall(cudaMalloc(addr, size[0] * size[1] * sizeof(float)));
	}

	float *addr;
	std::vector<int> size;

	int first_dim(void) {return size[0];}
	int second_dim(void) {return size[1];}
	float *begin() { return addr;}
};

template<bool AT, bool BT>
__inline__ void gemm_rm(cublasHandle_t handle, float *C, float *A, float *B, float alpha=1.0f, float beta=0.0f) {

	float gemm_alpha = alpha;
	float gemm_beta = beta;

	if (!AT && !BT) {
		// no transpose
		assert(A.second_dim() == B.first_dim() && "Inner dimension doesn't match");
	}
	else if (AT && !BT) {
		// transpose A only
		assert(A.first_dim() == B.first_dim() && "Inner dimension doesn't match");
	}
	else if (!AT && BT) {
		assert(A.second_dim() == B.second_dim() && "Inner dimension doesn't match");
	}
	else {
		assert("Why not do ba instead of aTbT?" && false);
	}
	
	blasErrchk(	cublasSgemm(handle,
		(BT) ? CUBLAS_OP_T : CUBLAS_OP_N,
		(AT) ? CUBLAS_OP_T : CUBLAS_OP_N,
		C.second_dim(),	// m -> c.second = k
		C.first_dim(),	// n -> c.first = m
		A.second_dim(),	// k -> a.second = n
		&gemm_alpha,
		B.begin(), B.second_dim(),
		A.begin(), A.second_dim(),
		&gemm_beta,
		C.begin(), C.second_dim()));
}

Robert_Crovella · January 17, 2019, 11:59pm

I guess this isn’t actually the code you’re running.

you can’t possibly do

C.second_dim()

when C is a pointer to float. Even if C is a pointer to float_tensor, you can’t do C. You could do C-> Maybe you’re just sketching out ideas here? Not sure why code sketches would be relevant to the problem at hand. Or if you have real code like this, why you wouldn’t just copy-paste that.

As an aside, the notion that ATBT = BA only applies to square matrices.

lyskule · January 18, 2019, 2:47am

Sorry that the title of this thread is misleading, but the error is actually from a call to cublasSgemm.

And you are right, this isn’t the code actually being used. I just want to provide a cleaner code here, rather than the messy actual code. The signature of this function is really

template<bool AT, bool BT>
__inline__ void gemm_rm(cublasHandle_t handle, float_tensor &C, float_tensor &A, float_tensor &B, float alpha=1.0f, float beta=0.0f);

But I still has this question, the same function works with matrices of size (m,k,n) 17408 x 32768 x 128 works, but not with 17408 x 16384 x 128 (just the middle dimension is different). And if I run it in memcheck mode, everything turned out fine.

Robert_Crovella · January 18, 2019, 4:33am

I suggest provided a short, complete, self-contained code that demonstrates the problem.

lyskule · January 18, 2019, 7:55pm

Thanks for your reply. I’ll try to describe it more clearly.

First of all, the code.

template <typename T>
class Tensor {
public:
	std::vector<int> dims_;
	size_t size_;
	std::shared_ptr<T> ptr_;

	struct deleteCudaPtr {
		void operator()(T *p) const {
			cudaFree(p);
		}
	};

	Tensor() {}

	Tensor(std::vector<int> dims) : dims_(dims) {
		T* tmp_ptr;
		size_ = std::accumulate(dims_.begin(), dims_.end(), 1, std::multiplies<int>());
		gpuErrchk( cudaMalloc(&tmp_ptr, sizeof(T) * size_) );
		ptr_.reset(tmp_ptr, deleteCudaPtr());
	}

	T* begin() const { return ptr_.get(); }
	T* end()   const { return ptr_.get() + size_; }
	size_t size() const { return size_; }
	int first_dim() const { return dims_[0]; }
	int second_dim() const { return dims_[1]; }
	size_t byte_size() const { return size_ * sizeof(T); }
	std::vector<int> dims() const { return dims_; }
	void fill_float(float num) {
		thrust::fill(thrust::device_ptr<T>(this->begin()),
			thrust::device_ptr<T>(this->end()), num);
	}

	void copy_from_host(T *h_tensor) {
		gpuErrchk(
			cudaMemcpy((void *)ptr_.get(), (const void *) h_tensor,
				byte_size(), cudaMemcpyHostToDevice)
		);
	}
};

typedef Tensor<float> float_tensor;

template<bool AT, bool BT>
__inline__ void gemm_rm(cublasHandle_t handle, float_tensor &C, float_tensor &A, float_tensor &B, float alpha=1.0f, float beta=0.0f) {

	float gemm_alpha = alpha;
	float gemm_beta = beta;

	if (!AT && !BT) {
		// no transpose
		assert(A.second_dim() == B.first_dim() && "Inner dimension doesn't match");
	}
	else if (AT && !BT) {
		// transpose A only
		assert(A.first_dim() == B.first_dim() && "Inner dimension doesn't match");
	}
	else if (!AT && BT) {
		// transpose B only
		assert(A.second_dim() == B.second_dim() && "Inner dimension doesn't match");
	}
	else {
		assert("Why not do ba instead of aTbT?" && false);
	}
	
	blasErrchk(	cublasSgemm(handle,
		(BT) ? CUBLAS_OP_T : CUBLAS_OP_N,
		(AT) ? CUBLAS_OP_T : CUBLAS_OP_N,
		C.second_dim(),	// m -> c.second = k
		C.first_dim(),	// n -> c.first = m
		A.second_dim(),	// k -> a.second = n
		&gemm_alpha,
		B.begin(), B.second_dim(),
		A.begin(), A.second_dim(),
		&gemm_beta,
		C.begin(), C.second_dim()));
}

Basically the float_tensor (typedef Tensor) is a wrapper around cuda device memory. The blas call is a wrapper around cublas.

The problem is that, I have two sets of matrices, they are

A: (32768, 17408), B: (32768, 128)
A: (16384, 17408), B: (16384, 128)

With everything the same, cublas won’t work on the second case. The way I call is:

blasErrchk(gemm_rm<true, false>(A, B)); // true to transpose A
gpuErrchk( cudaPeekAtLastError() );
gpuErrchk(cudaDeviceSynchronize());  // <---- error here, unspecified launch error

And I get an unspecified launch error at the cudaDeviceSynchronize() call. I’m positive that this blas call is the cause of the problem: if I remove this line, everything’s fine.

Another thing is, if I run the full code (posted below) in the cuda debugger, with or withoug memchecker, it runs fine.

Here’s the more complete code:

iter = 0
while (iter < 2) {
	X.copy_from_host(h_data);
	y.copy_from_host(h_label);
	float gradient_multiplier = lr / b;

	gpuErrchk( cudaPeekAtLastError() );
	gpuErrchk(cudaDeviceSynchronize());

	gemm_rm<false, false>(handle, y, X, weight, -1.0f, 1.0f);
	gpuErrchk( cudaPeekAtLastError() );
	gpuErrchk(cudaDeviceSynchronize());

	gemm_rm<true, false>(handle, weight, X, y, -gradient_multiplier, 1.0f);
	gpuErrchk( cudaPeekAtLastError() );
	gpuErrchk( cudaDeviceSynchronize()); // <-------- this is line line that errors out
	iter++;
}

Robert_Crovella · January 18, 2019, 8:06pm

Actually, I was asking for a complete code. Something I could copy, paste, compile, and run, without having to add anything or change anything, and see the issue (the kernel failure).

Robert_Crovella · January 18, 2019, 8:53pm

I have a suspicion that you have mixed up dimensions on your matrices.

What are the sizes of X, y and weight?
I’d like to see specifically the lines that create (with dimensions) those float_tensor objects.

As a diagnostic, you could simply print out the m,n,k values you are passing to the gemm call.

lyskule · January 18, 2019, 9:06pm

The reason I’m not posting is it dealt with a large data file.

Here’s the zip file. [url]https://drive.google.com/file/d/1PAGIY4Atgow6E5xvpUnUEJGNeqPUFbOP/view?usp=sharing[/url]

Go to the Regression_proj/Regression folder and you’ll see the vs solution.

One thing you might have to change is the working directory, from Project->Regression Properties->Debugging

Thanks for your help!

lyskule · January 18, 2019, 9:13pm

This is the print out I currently have. If you look inside the code, it’s under Regression.h where the float_tensors are created.

Memory read complete
Allocated 16384 x 17408 tensor at 0000000704360000
Allocated 16384 x 128 tensor at 0000000748360000
Allocated 17408 x 128 tensor at 0000000748B60000
cuBLAS Success

Regression configuration:
  - batch:       16384
  - duplication: 2
  - reshaping:   64
  - GEMM1 size:  (16384, 17408, 128)
  - GEMM2 size:  (17408, 16384, 128)

Iteration 0

-- Launching Memcpy X
|-- Copying 16384 x 17408 (1.06) GB data to 0000000704360000

-- Launching Memcpy y
|-- Copying 16384 x 128 (0.01) GB data to 0000000748360000
-- CP finished

--Launching GEMM 1
-----------
NN GEMM
-- A = 16384 x 17408
-- B = 17408 x   128
-- C = 16384 x   128
cuBLAS Success

|-- Peek error success, syncing device...
-- GEMM 1 finished

-- Launching GEMM 2
-----------
TN GEMM
-- AT = 17408 x 16384
-- B  = 16384 x   128
-- C  = 17408 x   128
cuBLAS Success

GPUassert: unspecified launch failure c:\users\luyuan\developer\thesis\misc writing\regression\Regression.h 108
|-- Sync device success, syncing device...
GPUassert: unspecified launch failure c:\users\luyuan\developer\thesis\misc writing\regression\Regression.h 110
-- GEMM 2 finished

Robert_Crovella · January 18, 2019, 9:25pm

I don’t want to wade through your code, nor do I want to download zip files from the internet.

If you want help from me, you’ll need to respond to my requests more directly. If not, perhaps someone else will be able to help you.

I took what you have shown so far, and attempted to create a test case around it. I’m looking for something that is self-contained like this, from you, that demonstrates the issue. Feel free to modify mine if you like.

Note that the following code definitely has a problem, but I don’t know if the matrices that I am feeding in match yours:

$ cat t2.cu
#include <iostream>
#include <cublas_v2.h>
#include <vector>
#include <memory>
#include <numeric>
#include <assert.h>
#include <thrust/device_ptr.h>
#include <thrust/fill.h>

#define gpuErrchk(x) assert((x) == cudaSuccess)
#define blasErrchk(x) assert((x) == CUBLAS_STATUS_SUCCESS)

template <typename T>
class Tensor {
public:
        std::vector<int> dims_;
        size_t size_;
        std::shared_ptr<T> ptr_;

        struct deleteCudaPtr {
                void operator()(T *p) const {
                        cudaFree(p);
                }
        };

        Tensor() {}

        Tensor(std::vector<int> dims) : dims_(dims) {
                T* tmp_ptr;
                size_ = std::accumulate(dims_.begin(), dims_.end(), 1, std::multiplies<int>());
                gpuErrchk( cudaMalloc(&tmp_ptr, sizeof(T) * size_) );
                ptr_.reset(tmp_ptr, deleteCudaPtr());
        }

        T* begin() const { return ptr_.get(); }
        T* end()   const { return ptr_.get() + size_; }
        size_t size() const { return size_; }
        int first_dim() const { return dims_[0]; }
        int second_dim() const { return dims_[1]; }
        size_t byte_size() const { return size_ * sizeof(T); }
        std::vector<int> dims() const { return dims_; }
        void fill_float(float num) {
                thrust::fill(thrust::device_ptr<T>(this->begin()),
                        thrust::device_ptr<T>(this->end()), num);
        }

        void copy_from_host(T *h_tensor) {
                gpuErrchk(
                        cudaMemcpy((void *)ptr_.get(), (const void *) h_tensor,
                                byte_size(), cudaMemcpyHostToDevice)
                );
        }
};

typedef Tensor<float> float_tensor;

template<bool AT, bool BT>
__inline__ void gemm_rm(cublasHandle_t handle, float_tensor &C, float_tensor &A, float_tensor &B, float alpha=1.0f, float beta=0.0f) {

        float gemm_alpha = alpha;
        float gemm_beta = beta;

        if (!AT && !BT) {
                // no transpose
                assert(A.second_dim() == B.first_dim() && "Inner dimension doesn't match");
        }
        else if (AT && !BT) {
                // transpose A only
                assert(A.first_dim() == B.first_dim() && "Inner dimension doesn't match");
        }
        else if (!AT && BT) {
                // transpose B only
                assert(A.second_dim() == B.second_dim() && "Inner dimension doesn't match");
        }
        else {
                assert("Why not do ba instead of aTbT?" && false);
        }
        std::cout << "m: " << C.second_dim() << " n: " << C.first_dim() << " k: " << A.second_dim() << std::endl;
        blasErrchk(     cublasSgemm(handle,
                (BT) ? CUBLAS_OP_T : CUBLAS_OP_N,
                (AT) ? CUBLAS_OP_T : CUBLAS_OP_N,
                C.second_dim(), // m -> c.second = k
                C.first_dim(),  // n -> c.first = m
                A.second_dim(), // k -> a.second = n
                &gemm_alpha,
                B.begin(), B.second_dim(),
                A.begin(), A.second_dim(),
                &gemm_beta,
                C.begin(), C.second_dim()));
}

int main(){
  // A=X, B=y ?
  float_tensor X({16384,17408});
  float_tensor y({16384,128});
  float_tensor weight({17408, 128});
  cublasHandle_t handle;
  cublasStatus_t stat = cublasCreate(&handle);
  float *h_data  = new float[16384*17408];
  float *h_label = new float[16384*128];
  int iter = 0;
  float lr = 1.0f;
  float b = 1.0f;
  while (iter < 1) {
        X.copy_from_host(h_data);
        y.copy_from_host(h_label);
        float gradient_multiplier = lr / b;

        gpuErrchk( cudaPeekAtLastError() );
        gpuErrchk(cudaDeviceSynchronize());
        std::cout << "1" << std::endl;
        gemm_rm<false, false>(handle, y, X, weight, -1.0f, 1.0f);
        gpuErrchk( cudaPeekAtLastError() );
        gpuErrchk(cudaDeviceSynchronize());
        std::cout << "2" << std::endl;

        gemm_rm<true, false>(handle, weight, X, y, -gradient_multiplier, 1.0f);
        gpuErrchk( cudaPeekAtLastError() );
        gpuErrchk( cudaDeviceSynchronize()); // <-------- this is line line that errors out
        iter++;
  }
}
$ nvcc -o t2 t2.cu -std=c++11 -lcublas
$ ./t2
1
m: 128 n: 16384 k: 17408
2
m: 128 n: 17408 k: 17408
$

Topic		Replies	Views
Program hit cudaErrorInvalidValue (error 1) due to "invalid argument" on CUDA API call to cudaMemsetAsync CUDA Programming and Performance	7	7135	January 11, 2020
Matlab mex file using cublas - problems CUDA Programming and Performance	13	8945	October 13, 2009
Unspecified launch failure error CUDA Programming and Performance	10	17876	January 6, 2018
Is it correct that my Pascal card is calling Maxwell_gemm kernels through cublas? And if so, why is cublas unusably slow for me? CUDA Programming and Performance	6	929	August 23, 2018
cuBLAS convolution does not use Tensor Cores GPU-Accelerated Libraries cublas	6	2165	June 8, 2021
How does cublasGemmEx() call work with CUDA_R_16F inputs and CUDA_R_32F computeType CUDA Programming and Performance	3	1822	December 10, 2017
Questions about cuFFT for 3D matrix, arrayFire GPU-Accelerated Libraries	5	1650	October 12, 2021
Simple CUDA program hitting size limits/errors on Windows but not Linux CUDA Programming and Performance	23	1858	January 12, 2019
CublasXT hybrid CPU/GPU matrix multiplication GPU-Accelerated Libraries	4	692	April 27, 2021
CUDA Toolkit 3.0 update GPU HW debugging tools to replace device emulation CUDA Programming and Performance	44	29434	April 29, 2010

Strange "unspecified launch error" from a call to cublas gemm

Related topics