matrix multiplication for large matrices

Can anyone tell me how to multiply a matrix of 200000 by 200000 matrix with 200000 by 200000 using shared memory and tiling? the examples given in programming guide or cuda by example does not support for matrices whose size is more than 1024. Or is it not possible to use shared memory with tiling for such large matrices? Is it necessary to launch grid in the order of resultant matrix? Thanks in advance :)

The matrix product you are asking about requires about 480Gb of memory in single precision, 960Gb in double precision. I would be much more worried about how to do this on a device with a maximum of 6Gb of ram, rather than any of the intricacies of the CUDA implementation.

Matrix Multiplication is a blocked algorithm is it not?So you can use streaming,althought you have to stream chuncks from the hard Drive to main memory and then to card.

Of course, but the mechanics of that sort of out-of-core gemm implementation completely dwarf the minutae of what goes on in the GPU. At that size, it would be folly to use anything other than CUBLAS or MagmaBLAS for the GPU gemm kernel.