Memory errors on Tesla K20c, GTX Titan (but not on GTX680)

So I ran into issues with several Tesla K20c GPUs running on Linux machines like this:

$ uname -a
Linux cluster-cn-211 3.2.0-61-generic #93-Ubuntu SMP Fri May 2 21:31:50 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

This is my GPU (there are two GPUs per computer):

$ nvidia-smi -a
Driver Version                      : 331.67
GPU 0000:03:00.0
    Product Name                    : Tesla K20c
    ...
    FB Memory Usage
        Total                       : 4799 MiB
        Used                        : 12 MiB
        Free                        : 4787 MiB
    Ecc Mode
        Current                     : Enabled
        Pending                     : Enabled

When I try to execute one of the sample applications, I run into a huge number of memory errors:

$ /usr/local/cuda/samples/0_Simple/matrixMul/matrixMul -wA=100 -hA=100 -wB=100 -hB=100 | head
[Matrix Multiply Using CUDA] - Starting...
GPU Device 1: "Tesla K20c" with compute capability 3.5

MatrixA(100,100), MatrixB(100,100)
Computing result using CUDA Kernel...
done
Performance= 91.03 GFlop/s, Time= 0.022 msec, Size= 2000000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Error! Matrix[00000]=2150740335083746752462848.00000000, ref=1.00000000 error term is > 1.000000E-06
Error! Matrix[00001]=2150740335083746752462848.00000000, ref=1.00000000 error term is > 1.000000E-06
Error! Matrix[00002]=2150740335083746752462848.00000000, ref=1.00000000 error term is > 1.000000E-06
...

If I run the same binary with memcheck, I get errors like this:

/usr/local/cuda/bin/cuda-memcheck /usr/local/cuda/samples/0_Simple/matrixMul/matrixMul -wA=100 -hA=100 -wB=100 -hB=100 | head
========= CUDA-MEMCHECK
========= Invalid __global__ read of size 4
=========     at 0x00000158 in void matrixMulCUDA<int=32>(float*, float*, float*, int, int)
=========     by thread (11,5,0) in block (0,0,0)
=========     Address 0xb00213bfc is out of bounds
=========     Saved host backtrace up to driver entry point at kernel launch time
=========     Host Frame:/usr/lib/libcuda.so (cuLaunchKernel + 0x331) [0x138291]
=========     Host Frame:/usr/local/cuda/samples/0_Simple/matrixMul/matrixMul [0x1b5b8]

Please note, the same binary, same Linux, same NVidia driver and same CUDA installation but with a different GPU (GTX 680) works flawlessly even with larger matrices. Only Tesla K20c and GTX Titan seem to have this problem in my system.

Also, the log file in /var/log/messages has a huge number of lines like this:

Jun 11 21:43:40 myhost kernel: [16942.564565] init: Handling drivers-device-added event
Jun 11 21:43:41 myhost kernel: [16942.641190] init: Handling drivers-device-removed event

And when I try to run some self-made CUDA code (kind of a hello world example), I get this error as return value from cudaMemcpy when copying from device to host:

77: an illegal memory access was encountered

There are no ECC errors in nvidia-smi’s output.

What does it all mean? Do I have a hardware defect? Is there any important configuration for dual GPU nodes I might have missed? Any driver bugs?