Need help with stream Operating on 2D array

Hi, currently I am transferring a 5000*5000 2D array onto the device and have a kernel work on it, but I found that the memory transfer takes at least twice as long as the kernel execution time, I’ve tried pageable and page-locked memory transfer but doesn’t make much difference.
I’d like to use stream to overlap portions of the operation on the 2D array, so far I haven’t had much luck with it, the sample code in the programming manual and the SDK project don’t seem to explain it well.
I’d really appreciate it if someone would give me some pointers on this. Thanks.