Hi

I need to invert a double precision Matrix of size 2144 x 2144 by LU decomposition. This is done by the cublas library:

```
void invertMatrix(cublasHandle_t cublasHandle, double* src, double* dst, int n)
{
int* permutation;
int* info;
int h_info1[1];
int h_info2[1];
cudaMalloc(&permutation, n *sizeof(int));
cudaMalloc(&info, sizeof(int));
double** A_h = (double **)malloc(sizeof(double *));
double** A_d;
cudaMalloc(&A_d,sizeof(double *));
A_h[0] = src;
cudaMemcpy(A_d,A_h,sizeof(double *),cudaMemcpyHostToDevice);
cublasDgetrfBatched(cublasHandle,n,A_d,n,permutation,info,1);
cudaMemcpyAsync(h_info1,info,sizeof(int),cudaMemcpyDeviceToHost);
double** C_h = (double **)malloc(sizeof(double *));
double** C_d;
cudaMalloc(&C_d,sizeof(double *));
C_h[0] = dst;
cudaMemcpy(C_d,C_h,sizeof(double *),cudaMemcpyHostToDevice);
cublasDgetriBatched(cublasHandle,n,(const double **)A_d,n,permutation,C_d,n,info,1);
cudaMemcpyAsync(h_info2,info,sizeof(int),cudaMemcpyDeviceToHost);
cudaThreadSynchronize();
if(h_info1[0] != 0 || h_info2[0] != 0)
{
fprintf(stderr, "Factorization of matrix failed: Matrix may be singular\n");
cudaDeviceReset();
exit(EXIT_FAILURE);
}
cudaFree(A_d);
free(A_h);
cudaFree(C_d);
free(C_h);
cudaFree(permutation);
cudaFree(info);
}
```

The code computes the correct result but the performance is very bad: Cublas needs **10 seconds** to compute the inverse (Matlab only needs **360 milliseconds**). I use a GeForce Titan X.

When I looked at the timeline using Nsight I realized that the kernels getri_panel and getrf_panel only achieve occupancy 6.2% (56.2% theoretical) and choose block size [128,1,1].

My programm also uses a few kernels that I wrote by myself. These achieve 75% oppupancy. Why is cublas performing so bad? Is there a way to tell cublas what block size it should use?