Program crashing when kernel blocksize is increased

Hi,
I was trying to improve performance of my program using nsight compute. For a particular kernel it recommends to increase the blocksize in multiples of 32 between 128 and 256. When i run the program with blocksize 32 the program runs without issues. But when i increase the blocksize to 256 the program crashes after few iterations. How can i resolve this issue.

Thanks,

This is the program i am running.

I removed the break at the end of the program.

The kernel i was trying to improve is find_ba_max_pd in do_abstract_all in

When i launch with blocksize 32
find_ba_max_pd[math.ceil(len(nz_ba_pre_hor)/32),32](nz_ba_pre_hor_d,ba_size_pre_hor_d,bound_data_ordered_d,ba_max_pd_pre_d,shape_d)
it doesn’t crash.

But when i launch with blocksize 256
find_ba_max_pd[math.ceil(len(nz_ba_pre_hor)/256),256](nz_ba_pre_hor_d,ba_size_pre_hor_d,bound_data_ordered_d,ba_max_pd_pre_d,shape_d)
it crashes.

Thanks,

Also, this program crashing happens on an orin nano but does not happen on jetson nano.

Thanks,