Check the error returned from the kernel, it should give you more information as to why the kernel fails.
Probably the configuration is wrong.
What are the values for w, h and the threads/blocks you use?
Grid size limit of 65535 is number of blocks (in each dimension x and y), not number of threads. Total threads could conceivably be as large as 6553565535512, provided nothing else prevents it.
Manjunath, you are very likely using too many registers. To fix this, run fewer threads per block.
Hi,
Instead of guessing, you should compile with the ptax-option (as specified above) and you’ll get exactly
how many registers/shared memory you use in each kernel.
On a 1.1 device, there are a total of 8192 registers available. With 512 threads per block, each thread may use no more than 16 registers. You are using 28.