memory distribution question

I have about 10000 8*8 matrices to take the inverses of. I have implement the function first and now am working on optimizations. The first optimization I am looking at is changing from generic mallocs to mallocPitch. I had a question about mallocPitch that I have seam to gotten wrong because the way I tried doesn’t output the correct results. First question how to you designate host memory to fit in the correct form to preform a cudaMemcpy2D operation for it to the device. Second if I am going to have to copy the array from global mem to shared mem, what is the best way to structure the shared mem to optimize performance? Any help would be greatly appreciated.