Sparse matrix manipulation

The kernel is taking long enough to hit the kernel timeout. It doesn’t mean the code is broken. With a bit of google searching you can find some suggestions. Jetson nano has a kernel timeout mechanism that is special (<- click link), and I’m not sure how to turn it off, but you may want to check that forum thread. You may want to ask about that on the jetson nano forum I have also verified that the kernel takes at least several seconds for a large enough array - 10,000,000 elements or more, on a relatively slow GPU. Actually, on a Quadro K1000, the previous code I posted (10,000,000 elements) takes about 3 minutes to run. nvprof gives the following timings:

==6124== Profiling application: python t70.py
==6124== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   99.75%  172.220s         1  172.220s  172.220s  172.220s  cudapy::__main__::prefix_sum_merge_blocks$243(Array<int, int=1, A, mutable, aligned>, Array<int, int=1, A, mutable, aligned>, int)
                    0.12%  202.44ms         1  202.44ms  202.44ms  202.44ms  cudapy::__main__::prefix_sum_nzmask_block$241(Array<int, int=1, A, mutable, aligned>, Array<int, int=1, A, mutable, aligned>, Array<int, int=1, A, mutable, aligned>, int)
                    0.06%  110.47ms         3  36.822ms  1.8560us  110.46ms  [CUDA memcpy DtoH]
                    0.06%  106.88ms         2  53.439ms  47.109ms  59.769ms  [CUDA memcpy HtoD]
                    0.01%  12.712ms         1  12.712ms  12.712ms  12.712ms  cudapy::__main__::map_non_zeros$244(Array<int, int=1, A, mutable, aligned>, Array<int, int=1, A, mutable, aligned>, Array<int, int=1, A, mutable, aligned>, int)
      API calls:   99.90%  172.546s         3  57.5154s  30.074us  172.422s  cuMemcpyDtoH
                    0.06%  107.02ms         2  53.508ms  47.174ms  59.842ms  cuMemcpyHtoD
    ...

so the heavy hitter is the prefix sum merge blocks kernel. It is probably ripe for optimization. Part of that kernel is a second stage prefix sum, but it is implemented in a naive fashion.