cublasxt auto chunking

One of the listed features for cublasxt is that matrices are limited only by host memory size, not gpu memory. I’m curious if this also applies to the free version running on a single GPU; i.e. I have a single 6GB Tesla and I try to do a matrix multiplication with two 10GB matrices (and the host has 64GB RAM) – can I just throw them in, or do I need to pre-chunk since it is the free version and/or a single GPU?

(I’d try it myself but I don’t have access to that configuration with CUDA 6 at the moment, trying to plan ahead and convince the cluster admins to upgrade from 5 sooner rather than later.)

Yes, you can just throw them in.
To get better perf, you shoud pin the matrix on the Host ( using malloc + cudaHostRegister, or cudaHostAlloc)

If you have a PCI Gen3 PC and Kepler K20 or K40, a block dimension of 2K is enough to overlap computation/PCI transfer. If you have PCI Gen2, make the tile bigger.