Actually what I did is, I reduced the variables on device functions (which were called two levels down). And that seems to do it (but no more room left). What is funny is that if you split the for loop into 2 for loops (each half the size), the machine hanges too, if you make the for loop the n in the for loop > 1, just adding one call to the code, it hangs. I used the shared memory because of speed considerations, but this stops me for making multiple kernels. I suspect caching the results (in global memory) is just going to kill my speed. The compute times are very small (as in milliseconds, so I don’t think I came anywhere close to the 5 second limit, and trying to add a second geforce card for display, does not seem easy. I think I have to remove the 8800, install the new card and then start from scratch).
Thanks for the help.