I have a Quadro FX 3800 card, and am working on very large data.
In my application, I use block size of (4,4,32) which is the limit of max threads/block. The grid size is (16,16,1) as a result.
I allocated 2 arrays on the device which occupy 3 MB of global memory.
In the caller function, I pass 7 structures to the kernel, each structure is 12 bytes. I pass the two big arrays as parameters too.
When I run the application, I got a small part of correct results and a bunch of crazy numbers.
I am guessing that I may be using too much memory on the device, but couldn’t figure it out.
Anyone has any idea?