Hello NVIDIA-forum,
I am running into some problems with my Tesla K-80’s and I have no idea how to start debugging this. It seems as if there is something wrong with the memory access.
Please see my gist for all outputs (versions, cuda-memcheck, nvidia-smi etc…). Note that it runs perfectly fine on Grid K2-s.
https://gist.github.com/MarkusPfundstein/4f6d6103b713ee9f9e72
The exception thrown by theano is:
Traceback (most recent call last):
File “/home/net/nn/main.py”, line 281, in
args.func(args)
File “/home/net/nn/main.py”, line 72, in train
nn.train(data_provider)
File “/home/net/nn/NN/neural_net.py”, line 399, in train
cost = self._train(batch_offset[0], batch_offset[1])
File “/home/net/anaconda/lib/python2.7/site-packages/theano/compile/function_module.py”, line 606, in call
storage_map=self.fn.storage_map)
File “/home/net/anaconda/lib/python2.7/site-packages/theano/compile/function_module.py”, line 595, in call
outputs = self.fn()
RuntimeError: GpuCorrMM encountered a CUBLAS error: an internal operation failed
This could be a known bug in CUDA, please see the GpuCorrMM() documentation.
Apply node that caused the error: GpuCorrMM{valid, (1, 1)}(GpuContiguous.0, GpuContiguous.0)
Inputs types: [CudaNdarrayType(float32, 4D), CudaNdarrayType(float32, 4D)]
Inputs shapes: [(3, 5, 128, 128), (3, 5, 118, 118)]
Inputs strides: [(81920, 16384, 128, 1), (69620, 13924, 118, 1)]
Inputs values: [‘not shown’, ‘not shown’]
HINT: Re-running with most Theano optimization disabled could give you a back-trace of when this node was created. This can be done with by setting the Theano flag ‘optimizer=fast_compile’. If that does not work, Theano optimizations can be disabled with ‘optimizer=None’.
HINT: Use the Theano flag ‘exception_verbosity=high’ for a debugprint and storage map footprint of
I’d really appreciate any help or points about where to look.
With best regards
Markus