The case is:
When I calculate 256 tasks in parallel like main<<<256,1>>>(), the program can run normally;
But when I run 512 tasks like main<<<512,1>>>(), the program has the following error:
GPU assert: an illegal memory access was encounted main_array.cu 994
There is a bug in your code. Very likely some sort of incorrect addressing arithmetic, causing an access to memory outside the data objects allocated. I am confident you can find it.
You have a giant advantage compared to us random strangers on the internet: You have the code to look at and experiment with. So I would suggest you spend some quality time debugging your code. Spending time debugging is an excellent way of becoming more proficient at it.
Run the code under control of compute-sanitizer and fix all items it complains about. Simplify the code in steps until you have a minimal code that reproduces the issue. Instrument the code to check that pointers and array indices do not exceed memory allocations. This might also be a good opportunity to become acquainted with the CUDA debugger, and its code stepping and data imspection capabilities.