Driver-dependent kernel execution speed Kernel slow-down with newer drivers

I was having stability issues with driver version 177.13 with a GTX 260 card, so I upgraded the driver to 177.67, 177.73, 177.80. With 177.13, execution time per block of computation was 8.5 seconds. With the newer drivers, this increased to 13 seconds. With cudaprof, I traced the culprit to a single kernel that was running about 3.5x slower. All the properties (# of LDs, STs, etc) for this kernel were the same except for the execution time. Has anyone seen this type of behavior? For interested parties, the kernel adds the outer product of two vectors to a matrix in global memory. It does this for many matrices in parallel, hence the arrays of pointers.

global void
update_inverse_cuda2 (T *A_g, T *Ainv_g, T *u_g,
T *Ainv_delta_g, T *Ainv_colk_g,
int N, int rowstride, int k)
shared T *A, *u, *Ainv, *Ainv_delta, *Ainv_colk;
if (threadIdx.x==0) {
A = A_g[blockIdx.y];
u = u_g[blockIdx.y];
Ainv = Ainv_g[blockIdx.y];
Ainv_delta = Ainv_delta_g[blockIdx.y];
Ainv_colk = Ainv_colk_g[blockIdx.y];

shared T Ainv_delta_shared[DET_BLOCK_SIZE];
shared T Ainv_colk_shared[DET_BLOCK_SIZE];
int col = blockIdx.x*DET_BLOCK_SIZE + threadIdx.x;
// Read the data back from global memory
Ainv_delta_shared[threadIdx.x] = Ainv_delta[col];
Ainv_colk_shared[threadIdx.x] = Ainv_colk[col];

A[k*rowstride + col] = u[col];

shared T prefact;
if (threadIdx.x == 0)
prefact = -1.0f/(1.0f+Ainv_delta[k]);

int numblocks = N / DET_BLOCK_SIZE;
for (int block=0; block<numblocks; block++) {
Ainv_colk_shared[threadIdx.x] =
for (int i=0; i<DET_BLOCK_SIZE; i++) {
int row = blockDET_BLOCK_SIZE + i;
rowstride+col] +=