Solve one dense linear system in one thread block

Could anybody suggest me, how to solve a dense linear system in one thread block because I have a lot of them and I want to solve simultaneously. I know, I can use CUBLAS for one system for at time, but thats I don’t want. Any suggestion will be very kind.

CUBLAS also has batched routines for exactly this purpose. Look for the *Batched versions in the manual or check e.g. this Stack Overflow post.

Thank You very much @tera

I have a little doubt on the provided code in the link given by you. In the LU decompose part , I think programmer try to initialize a host array using a device array.

float **h_inout_pointers = (float **)malloc(Nmatrices*sizeof(float *));
    for (int i=0; i<Nmatrices; i++) 
       h_inout_pointers[i]=(float *)((char*)d_A+i*((size_t)N*N)*sizeof(float));

where d_A is a device array. Please clarify me if I am wrong. Thanks in advance.

The h_inout_pointers are device pointers, because d_A, the base pointer they are derived from, originally was a device pointer. It is just that they are calculated on the host and then copied to the device.

Admittedly, the pointer calculation is unnecessarily complicated. It would be much cleaner to just write

float **h_inout_pointers = (float **)malloc(Nmatrices*sizeof(float *));
    for (int i=0; i<Nmatrices; i++) 
       h_inout_pointers[i] = d_A + i*N*N;

How could **h_inout_pointer be the device pointer because it is allocated with only malloc not Cudamalloc. d_A is a device pointer and it is accessed from the host in line 3. Can I do that?

What will be the problem if I change d_A by h_A in line 3?