Strange "unspecified launch error" from a call to cublas gemm

Hi there!

I encountered some really strange error, with host2device memcpy.

There’s no error thrown on cudaMalloc, and the size of the device memory is big enough to hold the data being transferred.

The real strange part is, it encounters no error in the memory checker (VS2015, enabled memory checking, ran in the CUDA debugging mode).

But once I did see some error like “grid launch failure” for one CuBLAS function, this happened only once.

The TDR is set to 3600, so it shouldn’t cause a problem here.

Some system info FWIW, Win10, CUDA10, VS2015, 780Ti card.

Thanks a lot for the help!

An “unspecified launch failure” is an error triggered by a CUDA kernel that was terminated abnormally. Errors are sticky until cleared, so if you encounter this error on a cudaMemcpy() that means you are missing proper error checking for a kernel launch upstream from the copy call.

“unspecified launch failure” could be due to an access out of bounds, hitting a GUI watchdog time limit, or any other error condition (thus “unspecified”). An error condition could be transient, such as a cosmic ray knocking out a bit in GPU memory. It could also involve indeterminate behavior, e.g. when there is a race condition.

The key to debugging is reproducibility. It sounds like you are unable to reproduce the error. You could still beef up the error checking for kernel launches and check for race conditions. Note that cuda-memcheck cannot detect all race conditions, and that code behavior may differ between debug and release builds.

In before @tera leaves his signature here.

Why would I want to leave my signature here?

How are your cudaMallocs going?

I indeed checked all cuda launches, but there seemed to be no error whatsoever. Let me briefly describe what I’m trying to do here. (some pseudo code here)

cudaMalloc(a, size1);
cudaMalloc(b, size2);

while(iter-- > 0) {
  cudaMemcpyHost2Device(h_data, a, size1);
  cudaMemcpyHost2Device(h_data, b, size2);
  
  cublas_sgemm(a, b);
  
  cublas_sgemm(b, a);   <--------- THIS ONE
  cudaPeekAtLastError();
  cudaDeviceSync();
}

I don’t think there’s racing, as they all happen on default stream, memcpy and compute shouldn’t have any overlapping.

cublas_sgemm(b, a) (on the fourth last line) is doing “good” as I checked its return code. BUT, everything runs smoothly with this line commented out, so this must be the issue.

I guess it might be something with the matrix sizes, but not quite sure.

Any suggestion welcome!

On the second gemm, size (m,k,n) 17408 x 32768 x 128 works, but 17408 x 16384 x 128 doesn’t…

cublas gemm is an asynchronous call. Therefore checking its return code doesn’t tell you if there was an execution error.

Run your code with cuda-memcheck

Is this observation solidly reproducible, i.e. for a given set of parameter, it always fails at the same iteration? After how many iterations? What are the SGEMM parameters used? Can you show your actual code? The issue may originate in code that you haven’t shown. Maybe your TDR control doesn’t work as intended. You might want to plot matrix size and execution time for increasing size and see where it fails.

I think it fairly safe to exclude a bug in cublasSgemm. This is one of the most heavily used functions across all of NVIDIA’s CUDA libraries, so it sees extensive testing both prior to release and through practical use in the field. What version of CUDA do you use?

If the failures occur kind of randomly, one might hypothesize about any number of failure mechanisms. What kind of GPU is this? GPUs age like all electronics, and eventually they become flaky, then die. Usually the DRAM on the board is the first component to fail, but I have seen a few failures of the actual GPU chip (in big farms of GPU that operate continuously for years).

Is there adequate cooling and power supply for the GPU? Insufficient power supply can lead to temporary local voltage drops (“brown outs”) which cause transistors to slow down and processors to malfunction. Is the GPU overclocked, that is, running at higher clocks than NVIDIA’s reference design? Some GPUs are already massively overclocked by the vendor. My recommendation is to avoid these models for CUDA work.

Failures due to overclocking can be very difficult to diagnose: I once operated an overclocked math co-processor which seemed to be doing fine, except that it occasionally produced a wrong square root result. And best I could tell, it was only the square root operation that was affected.

Note that overclocking and aging can work in combination. As the electronic components age (e.g. through electro migration, hot carrier injection), they operate more slowlt. For this reason electronic components are manufactured with an engineering margin, often 15%-20%, so that they work flawlessly over the intended life span, say five years. Overclocking exploits that engineering margin, leaving razor thin margins that are eaten up after a year or two of use: at that point the part fails.

[Later:] I notice belatedly that the OP already stated that the GPU here is a Geforce 780 Ti. This jogged my memory and I recalled that the AMBER folks found that particular model to be flaky even when new. For example, discussed here:

[url]http://archive.ambermd.org/201411/0374.html[/url]

My memory is hazy, but I think the issue was eventually chalked up to bad GPU memory timings parameters.

It’s indeed the cudaDeviceSync() who errored out.
Strangest thing, it runs fine in cuda-memcheck (VS2015, memcheck enabled, pressed “start CUDA debugging”).

TDR is doing as expected, I set that a long time ago and it’s doing well so far (had things ran for over 10 minutes). And yes, I agree this should not be CUBLAS’s problem…

It fails on the first iteration, at the second blas call, always. But you are right, it’s an old card, 780Ti, but it’s not overclocked as such.

My code is a bit messy, let me clean it up before posting it here.

Here’s the wrapper for cublas, if it helps. Definition of float_tensor is made up, please don’t mind. This function deals with row major matrices. That is, I have normal row-major matrices sitting in host memory, after I do the computation on GPU and copy it back, it should still be row-major on CPU.

struct float_tensor {
	float_tensor(int rows, int cols): size({rows, cols}) {
		checkedCall(cudaMalloc(addr, size[0] * size[1] * sizeof(float)));
	}

	float *addr;
	std::vector<int> size;

	int first_dim(void) {return size[0];}
	int second_dim(void) {return size[1];}
	float *begin() { return addr;}
};

template<bool AT, bool BT>
__inline__ void gemm_rm(cublasHandle_t handle, float *C, float *A, float *B, float alpha=1.0f, float beta=0.0f) {

	float gemm_alpha = alpha;
	float gemm_beta = beta;

	if (!AT && !BT) {
		// no transpose
		assert(A.second_dim() == B.first_dim() && "Inner dimension doesn't match");
	}
	else if (AT && !BT) {
		// transpose A only
		assert(A.first_dim() == B.first_dim() && "Inner dimension doesn't match");
	}
	else if (!AT && BT) {
		assert(A.second_dim() == B.second_dim() && "Inner dimension doesn't match");
	}
	else {
		assert("Why not do ba instead of aTbT?" && false);
	}
	
	blasErrchk(	cublasSgemm(handle,
		(BT) ? CUBLAS_OP_T : CUBLAS_OP_N,
		(AT) ? CUBLAS_OP_T : CUBLAS_OP_N,
		C.second_dim(),	// m -> c.second = k
		C.first_dim(),	// n -> c.first = m
		A.second_dim(),	// k -> a.second = n
		&gemm_alpha,
		B.begin(), B.second_dim(),
		A.begin(), A.second_dim(),
		&gemm_beta,
		C.begin(), C.second_dim()));
}

I guess this isn’t actually the code you’re running.

you can’t possibly do

C.second_dim()

when C is a pointer to float. Even if C is a pointer to float_tensor, you can’t do C. You could do C-> Maybe you’re just sketching out ideas here? Not sure why code sketches would be relevant to the problem at hand. Or if you have real code like this, why you wouldn’t just copy-paste that.

As an aside, the notion that ATBT = BA only applies to square matrices.

Sorry that the title of this thread is misleading, but the error is actually from a call to cublasSgemm.

And you are right, this isn’t the code actually being used. I just want to provide a cleaner code here, rather than the messy actual code. The signature of this function is really

template<bool AT, bool BT>
__inline__ void gemm_rm(cublasHandle_t handle, float_tensor &C, float_tensor &A, float_tensor &B, float alpha=1.0f, float beta=0.0f);

But I still has this question, the same function works with matrices of size (m,k,n) 17408 x 32768 x 128 works, but not with 17408 x 16384 x 128 (just the middle dimension is different). And if I run it in memcheck mode, everything turned out fine.

I suggest provided a short, complete, self-contained code that demonstrates the problem.

Thanks for your reply. I’ll try to describe it more clearly.

First of all, the code.

template <typename T>
class Tensor {
public:
	std::vector<int> dims_;
	size_t size_;
	std::shared_ptr<T> ptr_;

	struct deleteCudaPtr {
		void operator()(T *p) const {
			cudaFree(p);
		}
	};

	Tensor() {}

	Tensor(std::vector<int> dims) : dims_(dims) {
		T* tmp_ptr;
		size_ = std::accumulate(dims_.begin(), dims_.end(), 1, std::multiplies<int>());
		gpuErrchk( cudaMalloc(&tmp_ptr, sizeof(T) * size_) );
		ptr_.reset(tmp_ptr, deleteCudaPtr());
	}

	T* begin() const { return ptr_.get(); }
	T* end()   const { return ptr_.get() + size_; }
	size_t size() const { return size_; }
	int first_dim() const { return dims_[0]; }
	int second_dim() const { return dims_[1]; }
	size_t byte_size() const { return size_ * sizeof(T); }
	std::vector<int> dims() const { return dims_; }
	void fill_float(float num) {
		thrust::fill(thrust::device_ptr<T>(this->begin()),
			thrust::device_ptr<T>(this->end()), num);
	}

	void copy_from_host(T *h_tensor) {
		gpuErrchk(
			cudaMemcpy((void *)ptr_.get(), (const void *) h_tensor,
				byte_size(), cudaMemcpyHostToDevice)
		);
	}
};

typedef Tensor<float> float_tensor;

template<bool AT, bool BT>
__inline__ void gemm_rm(cublasHandle_t handle, float_tensor &C, float_tensor &A, float_tensor &B, float alpha=1.0f, float beta=0.0f) {

	float gemm_alpha = alpha;
	float gemm_beta = beta;

	if (!AT && !BT) {
		// no transpose
		assert(A.second_dim() == B.first_dim() && "Inner dimension doesn't match");
	}
	else if (AT && !BT) {
		// transpose A only
		assert(A.first_dim() == B.first_dim() && "Inner dimension doesn't match");
	}
	else if (!AT && BT) {
		// transpose B only
		assert(A.second_dim() == B.second_dim() && "Inner dimension doesn't match");
	}
	else {
		assert("Why not do ba instead of aTbT?" && false);
	}
	
	blasErrchk(	cublasSgemm(handle,
		(BT) ? CUBLAS_OP_T : CUBLAS_OP_N,
		(AT) ? CUBLAS_OP_T : CUBLAS_OP_N,
		C.second_dim(),	// m -> c.second = k
		C.first_dim(),	// n -> c.first = m
		A.second_dim(),	// k -> a.second = n
		&gemm_alpha,
		B.begin(), B.second_dim(),
		A.begin(), A.second_dim(),
		&gemm_beta,
		C.begin(), C.second_dim()));
}

Basically the float_tensor (typedef Tensor) is a wrapper around cuda device memory. The blas call is a wrapper around cublas.

The problem is that, I have two sets of matrices, they are

  1. A: (32768, 17408), B: (32768, 128)
  2. A: (16384, 17408), B: (16384, 128)

With everything the same, cublas won’t work on the second case. The way I call is:

blasErrchk(gemm_rm<true, false>(A, B)); // true to transpose A
gpuErrchk( cudaPeekAtLastError() );
gpuErrchk(cudaDeviceSynchronize());  // <---- error here, unspecified launch error

And I get an unspecified launch error at the cudaDeviceSynchronize() call. I’m positive that this blas call is the cause of the problem: if I remove this line, everything’s fine.

Another thing is, if I run the full code (posted below) in the cuda debugger, with or withoug memchecker, it runs fine.

Here’s the more complete code:

iter = 0
while (iter < 2) {
	X.copy_from_host(h_data);
	y.copy_from_host(h_label);
	float gradient_multiplier = lr / b;

	gpuErrchk( cudaPeekAtLastError() );
	gpuErrchk(cudaDeviceSynchronize());

	gemm_rm<false, false>(handle, y, X, weight, -1.0f, 1.0f);
	gpuErrchk( cudaPeekAtLastError() );
	gpuErrchk(cudaDeviceSynchronize());

	gemm_rm<true, false>(handle, weight, X, y, -gradient_multiplier, 1.0f);
	gpuErrchk( cudaPeekAtLastError() );
	gpuErrchk( cudaDeviceSynchronize()); // <-------- this is line line that errors out
	iter++;
}

Actually, I was asking for a complete code. Something I could copy, paste, compile, and run, without having to add anything or change anything, and see the issue (the kernel failure).

I have a suspicion that you have mixed up dimensions on your matrices.

What are the sizes of X, y and weight?
I’d like to see specifically the lines that create (with dimensions) those float_tensor objects.

As a diagnostic, you could simply print out the m,n,k values you are passing to the gemm call.

The reason I’m not posting is it dealt with a large data file.

Here’s the zip file. [url]https://drive.google.com/file/d/1PAGIY4Atgow6E5xvpUnUEJGNeqPUFbOP/view?usp=sharing[/url]

Go to the Regression_proj/Regression folder and you’ll see the vs solution.

One thing you might have to change is the working directory, from Project->Regression Properties->Debugging

Thanks for your help!

This is the print out I currently have. If you look inside the code, it’s under Regression.h where the float_tensors are created.

Memory read complete
Allocated 16384 x 17408 tensor at 0000000704360000
Allocated 16384 x 128 tensor at 0000000748360000
Allocated 17408 x 128 tensor at 0000000748B60000
cuBLAS Success

Regression configuration:
  - batch:       16384
  - duplication: 2
  - reshaping:   64
  - GEMM1 size:  (16384, 17408, 128)
  - GEMM2 size:  (17408, 16384, 128)

Iteration 0

-- Launching Memcpy X
|-- Copying 16384 x 17408 (1.06) GB data to 0000000704360000

-- Launching Memcpy y
|-- Copying 16384 x 128 (0.01) GB data to 0000000748360000
-- CP finished

--Launching GEMM 1
-----------
NN GEMM
-- A = 16384 x 17408
-- B = 17408 x   128
-- C = 16384 x   128
cuBLAS Success

|-- Peek error success, syncing device...
-- GEMM 1 finished

-- Launching GEMM 2
-----------
TN GEMM
-- AT = 17408 x 16384
-- B  = 16384 x   128
-- C  = 17408 x   128
cuBLAS Success

GPUassert: unspecified launch failure c:\users\luyuan\developer\thesis\misc writing\regression\Regression.h 108
|-- Sync device success, syncing device...
GPUassert: unspecified launch failure c:\users\luyuan\developer\thesis\misc writing\regression\Regression.h 110
-- GEMM 2 finished

I don’t want to wade through your code, nor do I want to download zip files from the internet.

If you want help from me, you’ll need to respond to my requests more directly. If not, perhaps someone else will be able to help you.

I took what you have shown so far, and attempted to create a test case around it. I’m looking for something that is self-contained like this, from you, that demonstrates the issue. Feel free to modify mine if you like.

Note that the following code definitely has a problem, but I don’t know if the matrices that I am feeding in match yours:

$ cat t2.cu
#include <iostream>
#include <cublas_v2.h>
#include <vector>
#include <memory>
#include <numeric>
#include <assert.h>
#include <thrust/device_ptr.h>
#include <thrust/fill.h>

#define gpuErrchk(x) assert((x) == cudaSuccess)
#define blasErrchk(x) assert((x) == CUBLAS_STATUS_SUCCESS)

template <typename T>
class Tensor {
public:
        std::vector<int> dims_;
        size_t size_;
        std::shared_ptr<T> ptr_;

        struct deleteCudaPtr {
                void operator()(T *p) const {
                        cudaFree(p);
                }
        };

        Tensor() {}

        Tensor(std::vector<int> dims) : dims_(dims) {
                T* tmp_ptr;
                size_ = std::accumulate(dims_.begin(), dims_.end(), 1, std::multiplies<int>());
                gpuErrchk( cudaMalloc(&tmp_ptr, sizeof(T) * size_) );
                ptr_.reset(tmp_ptr, deleteCudaPtr());
        }

        T* begin() const { return ptr_.get(); }
        T* end()   const { return ptr_.get() + size_; }
        size_t size() const { return size_; }
        int first_dim() const { return dims_[0]; }
        int second_dim() const { return dims_[1]; }
        size_t byte_size() const { return size_ * sizeof(T); }
        std::vector<int> dims() const { return dims_; }
        void fill_float(float num) {
                thrust::fill(thrust::device_ptr<T>(this->begin()),
                        thrust::device_ptr<T>(this->end()), num);
        }

        void copy_from_host(T *h_tensor) {
                gpuErrchk(
                        cudaMemcpy((void *)ptr_.get(), (const void *) h_tensor,
                                byte_size(), cudaMemcpyHostToDevice)
                );
        }
};

typedef Tensor<float> float_tensor;

template<bool AT, bool BT>
__inline__ void gemm_rm(cublasHandle_t handle, float_tensor &C, float_tensor &A, float_tensor &B, float alpha=1.0f, float beta=0.0f) {

        float gemm_alpha = alpha;
        float gemm_beta = beta;

        if (!AT && !BT) {
                // no transpose
                assert(A.second_dim() == B.first_dim() && "Inner dimension doesn't match");
        }
        else if (AT && !BT) {
                // transpose A only
                assert(A.first_dim() == B.first_dim() && "Inner dimension doesn't match");
        }
        else if (!AT && BT) {
                // transpose B only
                assert(A.second_dim() == B.second_dim() && "Inner dimension doesn't match");
        }
        else {
                assert("Why not do ba instead of aTbT?" && false);
        }
        std::cout << "m: " << C.second_dim() << " n: " << C.first_dim() << " k: " << A.second_dim() << std::endl;
        blasErrchk(     cublasSgemm(handle,
                (BT) ? CUBLAS_OP_T : CUBLAS_OP_N,
                (AT) ? CUBLAS_OP_T : CUBLAS_OP_N,
                C.second_dim(), // m -> c.second = k
                C.first_dim(),  // n -> c.first = m
                A.second_dim(), // k -> a.second = n
                &gemm_alpha,
                B.begin(), B.second_dim(),
                A.begin(), A.second_dim(),
                &gemm_beta,
                C.begin(), C.second_dim()));
}

int main(){
  // A=X, B=y ?
  float_tensor X({16384,17408});
  float_tensor y({16384,128});
  float_tensor weight({17408, 128});
  cublasHandle_t handle;
  cublasStatus_t stat = cublasCreate(&handle);
  float *h_data  = new float[16384*17408];
  float *h_label = new float[16384*128];
  int iter = 0;
  float lr = 1.0f;
  float b = 1.0f;
  while (iter < 1) {
        X.copy_from_host(h_data);
        y.copy_from_host(h_label);
        float gradient_multiplier = lr / b;

        gpuErrchk( cudaPeekAtLastError() );
        gpuErrchk(cudaDeviceSynchronize());
        std::cout << "1" << std::endl;
        gemm_rm<false, false>(handle, y, X, weight, -1.0f, 1.0f);
        gpuErrchk( cudaPeekAtLastError() );
        gpuErrchk(cudaDeviceSynchronize());
        std::cout << "2" << std::endl;

        gemm_rm<true, false>(handle, weight, X, y, -gradient_multiplier, 1.0f);
        gpuErrchk( cudaPeekAtLastError() );
        gpuErrchk( cudaDeviceSynchronize()); // <-------- this is line line that errors out
        iter++;
  }
}
$ nvcc -o t2 t2.cu -std=c++11 -lcublas
$ ./t2
1
m: 128 n: 16384 k: 17408
2
m: 128 n: 17408 k: 17408
$