Theano ConvNet + NVIDIA Tesla-K80 - CuBLAS errors -> memcpyAsync

Hello NVIDIA-forum,

I am running into some problems with my Tesla K-80’s and I have no idea how to start debugging this. It seems as if there is something wrong with the memory access.

Please see my gist for all outputs (versions, cuda-memcheck, nvidia-smi etc…). Note that it runs perfectly fine on Grid K2-s.

The exception thrown by theano is:
Traceback (most recent call last):
File “/home/net/nn/”, line 281, in
File “/home/net/nn/”, line 72, in train
File “/home/net/nn/NN/”, line 399, in train
cost = self._train(batch_offset[0], batch_offset[1])
File “/home/net/anaconda/lib/python2.7/site-packages/theano/compile/”, line 606, in call
File “/home/net/anaconda/lib/python2.7/site-packages/theano/compile/”, line 595, in call
outputs = self.fn()
RuntimeError: GpuCorrMM encountered a CUBLAS error: an internal operation failed
This could be a known bug in CUDA, please see the GpuCorrMM() documentation.

Apply node that caused the error: GpuCorrMM{valid, (1, 1)}(GpuContiguous.0, GpuContiguous.0)
Inputs types: [CudaNdarrayType(float32, 4D), CudaNdarrayType(float32, 4D)]
Inputs shapes: [(3, 5, 128, 128), (3, 5, 118, 118)]
Inputs strides: [(81920, 16384, 128, 1), (69620, 13924, 118, 1)]
Inputs values: [‘not shown’, ‘not shown’]

HINT: Re-running with most Theano optimization disabled could give you a back-trace of when this node was created. This can be done with by setting the Theano flag ‘optimizer=fast_compile’. If that does not work, Theano optimizations can be disabled with ‘optimizer=None’.
HINT: Use the Theano flag ‘exception_verbosity=high’ for a debugprint and storage map footprint of

I’d really appreciate any help or points about where to look.

With best regards

Have you looked at the documentation for GpuCorrMM(), as suggested by the error message above?

CUBLAS internal operation failures are usually indicative of a kernel failing to execute on the GPU. Your CUDA installation may not be operational. As a first step, make sure that you can run some of the simple CUDA and CUBLAS sample applications that ship with CUDA.

Hey Njuffa,

thanks a lot for your answer. I thought about the same thing. Here is a testrun of all the NVIDIA samples that ship with the CUDA SDK:

As you can see, most of them (n - 4) succeed. The four that fail are:


  • “Error: Condition (allocation_cb == 1) failed at cuHook.cpp:155”
    "cuHook sample failed (DIdn’t receive the allocation callback)


  • no idea what goes wrong here. It saus: value of TestResult 0

(Inter-GPU transfer)

  • Data check error at index 0 in process 1!:


  • times out

Best I can tell from the Theano sources, the issue with GpuCorrMM() mentioned in the message pertains to CUDA 5.0 through CUDA 6.0. What CUDA version are you running?

I cannot readily think of a scenario where code runs fine on a Grid K2 (high-end sm_30 device) and fails to run on a K80 (high-end sm_37 device). I am not familiar with Theano. How are you building the code? From what I can tell, a user can build various configurations, e.g. with or without CUDNN support.

im having the same problem with gtx 1080 cuda 8.0.
works fine with models with less layers but when it comes to vgg19 or vgg16 it fails.

Solved re installing theano from source.