Why "an illegal memory access was encountered" when numbre block increase

The case is:
When I calculate 256 tasks in parallel like main<<<256,1>>>(), the program can run normally;
But when I run 512 tasks like main<<<512,1>>>(), the program has the following error:

GPU assert: an illegal memory access was encounted main_array.cu 994

There is a bug in your code. Very likely some sort of incorrect addressing arithmetic, causing an access to memory outside the data objects allocated. I am confident you can find it.

this may be of interest.

But i can not find it! but why the codes works when block size is less than 512 ? Shouldn’t he have been wrong from the start?

You have a giant advantage compared to us random strangers on the internet: You have the code to look at and experiment with. So I would suggest you spend some quality time debugging your code. Spending time debugging is an excellent way of becoming more proficient at it.

Run the code under control of compute-sanitizer and fix all items it complains about. Simplify the code in steps until you have a minimal code that reproduces the issue. Instrument the code to check that pointers and array indices do not exceed memory allocations. This might also be a good opportunity to become acquainted with the CUDA debugger, and its code stepping and data imspection capabilities.

can you solve this?
I meet the same problem