So I ran into issues with several Tesla K20c GPUs running on Linux machines like this:
$ uname -a Linux cluster-cn-211 3.2.0-61-generic #93-Ubuntu SMP Fri May 2 21:31:50 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
This is my GPU (there are two GPUs per computer):
$ nvidia-smi -a Driver Version : 331.67 GPU 0000:03:00.0 Product Name : Tesla K20c ... FB Memory Usage Total : 4799 MiB Used : 12 MiB Free : 4787 MiB Ecc Mode Current : Enabled Pending : Enabled
When I try to execute one of the sample applications, I run into a huge number of memory errors:
$ /usr/local/cuda/samples/0_Simple/matrixMul/matrixMul -wA=100 -hA=100 -wB=100 -hB=100 | head [Matrix Multiply Using CUDA] - Starting... GPU Device 1: "Tesla K20c" with compute capability 3.5 MatrixA(100,100), MatrixB(100,100) Computing result using CUDA Kernel... done Performance= 91.03 GFlop/s, Time= 0.022 msec, Size= 2000000 Ops, WorkgroupSize= 1024 threads/block Checking computed result for correctness: Error! Matrix=2150740335083746752462848.00000000, ref=1.00000000 error term is > 1.000000E-06 Error! Matrix=2150740335083746752462848.00000000, ref=1.00000000 error term is > 1.000000E-06 Error! Matrix=2150740335083746752462848.00000000, ref=1.00000000 error term is > 1.000000E-06 ...
If I run the same binary with memcheck, I get errors like this:
/usr/local/cuda/bin/cuda-memcheck /usr/local/cuda/samples/0_Simple/matrixMul/matrixMul -wA=100 -hA=100 -wB=100 -hB=100 | head ========= CUDA-MEMCHECK ========= Invalid __global__ read of size 4 ========= at 0x00000158 in void matrixMulCUDA<int=32>(float*, float*, float*, int, int) ========= by thread (11,5,0) in block (0,0,0) ========= Address 0xb00213bfc is out of bounds ========= Saved host backtrace up to driver entry point at kernel launch time ========= Host Frame:/usr/lib/libcuda.so (cuLaunchKernel + 0x331) [0x138291] ========= Host Frame:/usr/local/cuda/samples/0_Simple/matrixMul/matrixMul [0x1b5b8]
Please note, the same binary, same Linux, same NVidia driver and same CUDA installation but with a different GPU (GTX 680) works flawlessly even with larger matrices. Only Tesla K20c and GTX Titan seem to have this problem in my system.
Also, the log file in /var/log/messages has a huge number of lines like this:
Jun 11 21:43:40 myhost kernel: [16942.564565] init: Handling drivers-device-added event Jun 11 21:43:41 myhost kernel: [16942.641190] init: Handling drivers-device-removed event
And when I try to run some self-made CUDA code (kind of a hello world example), I get this error as return value from cudaMemcpy when copying from device to host:
77: an illegal memory access was encountered
There are no ECC errors in nvidia-smi’s output.
What does it all mean? Do I have a hardware defect? Is there any important configuration for dual GPU nodes I might have missed? Any driver bugs?