Limit of memory per thread? can't find a solution in the programmers guide

I’m writing a merge sorting network on CUDA. It starts from merging chunks of size 16 (number of threads=Tab_size/16), then 32 and so on, reducing number of threads by half in each iteration and increasing two times number of elements processed by one thread. Everityhing runs smoothly until each thread gets 2^17 integers. Then I get a video driver crash and an exception in VS2008 console. All thread operate on single table, stored in global memory. cudamalloc allocates without a word. So, is there any way to operate large arrays by threads?

do you use cuda-memchk (memory checker) to check your kernel?