Sparse matrix manipulation

Hii,
How can I reduce a 2D sparse matrix into a 1D array of just the non-zero items in serial order using numba CUDA.

From
[[0 0 1 0 0]
[0 2 0 0 0]
[0 0 0 0 3]]
to
[1 2 3]

Thanks,

hint: it is a reduction problem
hint2: it is a stream-compaction problem

parallel stream compaction is described in various forum posts. It could use a parallel prefix sum, followed by an indexed copy.

Thanks,

Is the code given in the link to forum post actually a working example or does it require any modification. I am getting errors. Is there a working example of the algorithm in numba CUDA elsewhere. Couldn’t find one for numba CUDA.

Thanks,

That was the closest I could find for numba cuda.

I ran a few small simple tests on that code, it seemed to work correctly.

FWIW cupy should be interoperable with numba cuda.

The following code should remove the hazards I mentioned in the comments to that numba cuda example:

import numpy as np
from numba import cuda, int32

BSP2 = 3
BLOCK_SIZE = 2**BSP2

#CUDA kernel to calculate prefix sum of each block of input array
@cuda.jit('void(int32[:], int32[:], int32[:], int32)')
def prefix_sum_nzmask_block(a, b, s, length):
    ab = cuda.shared.array(shape=(BLOCK_SIZE), dtype=int32)

    tid = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x;

    if tid < length:
        ab[cuda.threadIdx.x] = int32(a[tid] != 0); #Load mask of input data into shared memory

    for j in range(0,BSP2):
        i = 2**j
        cuda.syncthreads()
        if i <= cuda.threadIdx.x:
            temp = ab[cuda.threadIdx.x]
            temp += ab[cuda.threadIdx.x - i] #Perform scan on shared memory
        cuda.syncthreads()
        if i <= cuda.threadIdx.x:
            ab[cuda.threadIdx.x] = temp

    b[tid] = ab[cuda.threadIdx.x]; #Write scanned blocks to global memory

    if(cuda.threadIdx.x == cuda.blockDim.x-1):  #Last thread of block
        s[cuda.blockIdx.x] = ab[cuda.threadIdx.x]; #Write last element of shared memory into global memory

(the remainder of the code there is unchanged)

This somewhat larger test case seems to work correctly for me:

$ cat t70.py
import numpy as np
from numba import cuda, int32

BSP2 = 9
BLOCK_SIZE = 2**BSP2

#CUDA kernel to calculate prefix sum of each block of input array
@cuda.jit('void(int32[:], int32[:], int32[:], int32)')
def prefix_sum_nzmask_block(a, b, s, length):
    ab = cuda.shared.array(shape=(BLOCK_SIZE), dtype=int32)

    tid = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x;

    if tid < length:
        ab[cuda.threadIdx.x] = int32(a[tid] != 0); #Load mask of input data into shared memory

    for j in range(0,BSP2):
        i = 2**j
        cuda.syncthreads()
        if i <= cuda.threadIdx.x:
            temp = ab[cuda.threadIdx.x]
            temp += ab[cuda.threadIdx.x - i] #Perform scan on shared memory
        cuda.syncthreads()
        if i <= cuda.threadIdx.x:
            ab[cuda.threadIdx.x] = temp

    b[tid] = ab[cuda.threadIdx.x]; #Write scanned blocks to global memory

    if(cuda.threadIdx.x == cuda.blockDim.x-1):  #Last thread of block
        s[cuda.blockIdx.x] = ab[cuda.threadIdx.x]; #Write last element of shared memory into global memory



#CUDA kernel to merge the prefix sums of individual blocks
@cuda.jit('void(int32[:], int32[:], int32)')
def prefix_sum_merge_blocks(b, s, length):
    tid = (cuda.blockIdx.x + 1) * cuda.blockDim.x + cuda.threadIdx.x; #Skip first block

    if tid<length:
        i = 0
        while i<=cuda.blockIdx.x:
            b[tid] += s[i] #Accumulate last elements of all previous blocks
            i += 1


#CUDA kernel to copy non-zero entries to the correct index of the output array
@cuda.jit('void(int32[:], int32[:], int32[:], int32)')
def map_non_zeros(a, prefix_sum, nz, length):
    tid = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x;

    if tid < length:
        input_value = a[tid]
        if input_value != 0:
            index = prefix_sum[tid] #The correct output index is the value at current index of prefix sum array
            nz[index-1] = input_value



#Apply stream compaction algorithm to get only the non-zero entries from the input array
def get_non_zeros(a):
    length = a.shape[0]
    block = BLOCK_SIZE
    grid = int((length + block - 1)/block)

    #Create auxiliary array to hold the sum of each block
    block_sum = cuda.device_array(shape=(grid), dtype=np.int32)

    #Copy input array from host to device
    ad = cuda.to_device(a)

    #Create prefix sum output array
    bd = cuda.device_array_like(ad)

    #Perform partial scan of each block. Store block sum in auxillary array named block_sum.
    prefix_sum_nzmask_block[grid, block](ad, bd, block_sum, length)

    #Add block sum to the output
    prefix_sum_merge_blocks[grid, block](bd,block_sum,length);

    #The last element of prefix sum contains the total number of non-zero elements
    non_zero_count = int(bd[bd.shape[0]-1])

    #Create device output array to hold ONLY the non-zero entries
    non_zeros = cuda.device_array(shape=(non_zero_count), dtype=np.int32)

    #Copy ONLY the non-zero entries
    map_non_zeros[grid, block](a, bd, non_zeros, length)

    #Return to host
    return non_zeros.copy_to_host()


if __name__ == '__main__':

    arr = np.zeros(10000000, dtype=np.int32)
    for i in range(32,65000,1024):
        arr[i] = i

    nz = get_non_zeros(arr)

    print(nz)
$ python t70.py
[   32  1056  2080  3104  4128  5152  6176  7200  8224  9248 10272 11296
 12320 13344 14368 15392 16416 17440 18464 19488 20512 21536 22560 23584
 24608 25632 26656 27680 28704 29728 30752 31776 32800 33824 34848 35872
 36896 37920 38944 39968 40992 42016 43040 44064 45088 46112 47136 48160
 49184 50208 51232 52256 53280 54304 55328 56352 57376 58400 59424 60448
 61472 62496 63520 64544]
$
Traceback (most recent call last):
  File "pynvrebel.py", line 2712, in <module>
    main()
  File "pynvrebel.py", line 122, in main
    nz=get_non_zeros(arr)
  File "pynvrebel.py", line 169, in get_non_zeros
    non_zero_count=int(bd[bd.shape[0]-1])
  File "/usr/lib/python3/dist-packages/numba/cuda/cudadrv/devicearray.py", line 332, in __getitem__
    return self._do_getitem(item)
  File "/usr/lib/python3/dist-packages/numba/cuda/cudadrv/devicearray.py", line 353, in _do_getitem
    stream=stream)
  File "/usr/lib/python3/dist-packages/numba/cuda/cudadrv/driver.py", line 1597, in device_to_host
    fn(host_pointer(dst), device_pointer(src), size, *varargs)
  File "/usr/lib/python3/dist-packages/numba/cuda/cudadrv/driver.py", line 288, in safe_cuda_api_call
    self._check_error(fname, retcode)
  File "/usr/lib/python3/dist-packages/numba/cuda/cudadrv/driver.py", line 323, in _check_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [702] Call to cuMemcpyDtoH results in CUDA_ERROR_LAUNCH_TIMEOUT

This is the error i get now.

The kernel is taking long enough to hit the kernel timeout. It doesn’t mean the code is broken. With a bit of google searching you can find some suggestions. Jetson nano has a kernel timeout mechanism that is special (<- click link), and I’m not sure how to turn it off, but you may want to check that forum thread. You may want to ask about that on the jetson nano forum I have also verified that the kernel takes at least several seconds for a large enough array - 10,000,000 elements or more, on a relatively slow GPU. Actually, on a Quadro K1000, the previous code I posted (10,000,000 elements) takes about 3 minutes to run. nvprof gives the following timings:

==6124== Profiling application: python t70.py
==6124== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   99.75%  172.220s         1  172.220s  172.220s  172.220s  cudapy::__main__::prefix_sum_merge_blocks$243(Array<int, int=1, A, mutable, aligned>, Array<int, int=1, A, mutable, aligned>, int)
                    0.12%  202.44ms         1  202.44ms  202.44ms  202.44ms  cudapy::__main__::prefix_sum_nzmask_block$241(Array<int, int=1, A, mutable, aligned>, Array<int, int=1, A, mutable, aligned>, Array<int, int=1, A, mutable, aligned>, int)
                    0.06%  110.47ms         3  36.822ms  1.8560us  110.46ms  [CUDA memcpy DtoH]
                    0.06%  106.88ms         2  53.439ms  47.109ms  59.769ms  [CUDA memcpy HtoD]
                    0.01%  12.712ms         1  12.712ms  12.712ms  12.712ms  cudapy::__main__::map_non_zeros$244(Array<int, int=1, A, mutable, aligned>, Array<int, int=1, A, mutable, aligned>, Array<int, int=1, A, mutable, aligned>, int)
      API calls:   99.90%  172.546s         3  57.5154s  30.074us  172.422s  cuMemcpyDtoH
                    0.06%  107.02ms         2  53.508ms  47.174ms  59.842ms  cuMemcpyHtoD
    ...

so the heavy hitter is the prefix sum merge blocks kernel. It is probably ripe for optimization. Part of that kernel is a second stage prefix sum, but it is implemented in a naive fashion.

I have seen pseudo code for work efficient prefix sum algorithm on the internet. But not sure how to port it to numba cuda. In my case the input is a 1D sparse array of length around 3 million.

Here is an improved version.

  1. There was still a bug in the posted kernel
  2. I’ve modified the algorithm to call the prefix sum recursively, rather than doing the naive “fixup” kernel.

On my quadro K2000, for 50,000,000 elements, the overall program executes in a few seconds, and the longest duration kernel call is less than 300ms :

$ cat t71.py
import numpy as np
from numba import cuda, int32

BSP2 = 9
BLOCK_SIZE = 2**BSP2

#CUDA kernel to calculate prefix sum of each block of input array
@cuda.jit('void(int32[:], int32[:], int32[:], int32, int32)')
def prefix_sum_nzmask_block(a, b, s, nzm, length):
    ab = cuda.shared.array(shape=(BLOCK_SIZE), dtype=int32)

    tid = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x;
    ab[cuda.threadIdx.x] = 0
    if tid < length:
        if nzm == 1:
            ab[cuda.threadIdx.x] = int32(a[tid] != 0); #Load mask of input data into shared memory
        else:
            ab[cuda.threadIdx.x] = int32(a[tid]); #Load input data into shared memory


    for j in range(0,BSP2):
        i = 2**j
        cuda.syncthreads()
        if i <= cuda.threadIdx.x:
            temp = ab[cuda.threadIdx.x]
            temp += ab[cuda.threadIdx.x - i] #Perform scan on shared memory
        cuda.syncthreads()
        if i <= cuda.threadIdx.x:
            ab[cuda.threadIdx.x] = temp
    if tid < length:
        b[tid] = ab[cuda.threadIdx.x]; #Write scanned blocks to global memory

    if(cuda.threadIdx.x == cuda.blockDim.x-1):  #Last thread of block
        s[cuda.blockIdx.x] = ab[cuda.threadIdx.x]; #Write last element of shared memory into global memory



#CUDA kernel to merge the prefix sums of individual blocks
@cuda.jit('void(int32[:], int32[:], int32)')
def pref_sum_update(b, s, length):
    tid = (cuda.blockIdx.x + 1) * cuda.blockDim.x + cuda.threadIdx.x; #Skip first block

    if tid<length:
        b[tid] += s[cuda.blockIdx.x] #Accumulate last elements of all previous blocks


#CUDA kernel to copy non-zero entries to the correct index of the output array
@cuda.jit('void(int32[:], int32[:], int32[:], int32)')
def map_non_zeros(a, prefix_sum, nz, length):
    tid = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x;

    if tid < length:
        input_value = a[tid]
        if input_value != 0:
            index = prefix_sum[tid] #The correct output index is the value at current index of prefix sum array
            nz[index-1] = input_value



#Apply stream compaction algorithm to get only the non-zero entries from the input array
def pref_sum(a, asum, nzm):
    block = BLOCK_SIZE
    length = a.shape[0]
    grid = int((length + block -1)/block)
    #Create auxiliary array to hold the sum of each block
    bs = cuda.device_array(shape=(grid), dtype=np.int32)

    #Perform partial scan of each block. Store block sum in auxillary array named block_sum.
    prefix_sum_nzmask_block[grid, block](a, asum, bs, nzm, length)
    if grid > 1:
        bssum = cuda.device_array(shape=(grid), dtype=np.int32)
        pref_sum(bs, bssum, 0)
        pref_sum_update[grid-1, block](asum, bssum, length)

def get_non_zeros(a):
    #Copy input array from host to device
    ad = cuda.to_device(a)

    #Create prefix sum output array
    bd = cuda.device_array_like(ad)

    #Perform partial scan of each block. Store block sum in auxillary array named block_sum.
    pref_sum(ad, bd, int(1))

    #The last element of prefix sum contains the total number of non-zero elements
    non_zero_count = int(bd[bd.shape[0]-1])
    #Create device output array to hold ONLY the non-zero entries
    non_zeros = cuda.device_array(shape=(non_zero_count), dtype=np.int32)

    #Copy ONLY the non-zero entries
    block = BLOCK_SIZE
    length = a.shape[0]
    grid = int((length + block -1)/block)
    map_non_zeros[grid, block](ad, bd, non_zeros, length)

    #Return to host
    return non_zeros.copy_to_host()


if __name__ == '__main__':

    arr = np.zeros(50000000, dtype=np.int32)
    for i in range(32,65000, 1024):
        arr[i] = i

    nz = get_non_zeros(arr)

    print(nz)
$ cuda-memcheck python t71.py
========= CUDA-MEMCHECK
[   32  1056  2080  3104  4128  5152  6176  7200  8224  9248 10272 11296
 12320 13344 14368 15392 16416 17440 18464 19488 20512 21536 22560 23584
 24608 25632 26656 27680 28704 29728 30752 31776 32800 33824 34848 35872
 36896 37920 38944 39968 40992 42016 43040 44064 45088 46112 47136 48160
 49184 50208 51232 52256 53280 54304 55328 56352 57376 58400 59424 60448
 61472 62496 63520 64544]
========= ERROR SUMMARY: 0 errors
$ nvprof python t71.py
==7940== NVPROF is profiling process 7940, command: python t71.py
[   32  1056  2080  3104  4128  5152  6176  7200  8224  9248 10272 11296
 12320 13344 14368 15392 16416 17440 18464 19488 20512 21536 22560 23584
 24608 25632 26656 27680 28704 29728 30752 31776 32800 33824 34848 35872
 36896 37920 38944 39968 40992 42016 43040 44064 45088 46112 47136 48160
 49184 50208 51232 52256 53280 54304 55328 56352 57376 58400 59424 60448
 61472 62496 63520 64544]
==7940== Profiling application: python t71.py
==7940== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   47.58%  225.56ms         3  75.187ms  10.304us  225.11ms  cudapy::__main__::prefix_sum_nzmask_block$241(Array<int, int=1, A, mutable, aligned>, Array<int, int=1, A, mutable, aligned>, Array<int, int=1, A, mutable, aligned>, int, int)
                   23.40%  110.92ms         3  36.975ms  1.8880us  110.92ms  [CUDA memcpy DtoH]
                   22.84%  108.26ms         2  54.131ms  48.212ms  60.049ms  [CUDA memcpy HtoD]
                    3.49%  16.543ms         2  8.2716ms  34.432us  16.509ms  cudapy::__main__::pref_sum_update$243(Array<int, int=1, A, mutable, aligned>, Array<int, int=1, A, mutable, aligned>, int)
                    2.69%  12.731ms         1  12.731ms  12.731ms  12.731ms  cudapy::__main__::map_non_zeros$244(Array<int, int=1, A, mutable, aligned>, Array<int, int=1, A, mutable, aligned>, Array<int, int=1, A, mutable, aligned>, int)
      API calls:   68.68%  364.85ms         3  121.62ms  30.500us  240.35ms  cuMemcpyDtoH
                   20.39%  108.35ms         2  54.173ms  48.223ms  60.123ms  cuMemcpyHtoD
                    9.83%  52.205ms         1  52.205ms  52.205ms  52.205ms  cuDevicePrimaryCtxRetain
                    0.34%  1.8180ms         3  605.99us  532.71us  745.07us  cuModuleLoadDataEx
                    0.27%  1.4492ms         9  161.03us  12.127us  486.82us  cuMemAlloc
                    0.17%  900.11us         8  112.51us  6.9130us  329.48us  cuMemFree
                    0.11%  601.80us         3  200.60us  60.972us  477.62us  cuLinkCreate
                    0.09%  467.57us         3  155.86us  127.86us  205.38us  cuLinkAddData
                    0.07%  394.45us         3  131.48us  120.79us  151.99us  cuLinkComplete
                    0.02%  120.57us         6  20.094us  14.765us  40.249us  cuLaunchKernel
                    0.01%  36.027us         1  36.027us  36.027us  36.027us  cuMemGetInfo
                    0.01%  28.882us         1  28.882us  28.882us  28.882us  cuDeviceGetName
                    0.00%  8.5660us        15     571ns     369ns  1.3050us  cuFuncGetAttribute
                    0.00%  5.4590us         3  1.8190us  1.6790us  1.9680us  cuModuleGetFunction
                    0.00%  3.7140us         3  1.2380us  1.1560us  1.3180us  cuLinkDestroy
                    0.00%  3.2680us         1  3.2680us  3.2680us  3.2680us  cuDeviceGetPCIBusId
                    0.00%  2.0330us         3     677ns     315ns     910ns  cuDeviceGetCount
                    0.00%  1.9580us         1  1.9580us  1.9580us  1.9580us  cuCtxPushCurrent
                    0.00%  1.1630us         2     581ns     571ns     592ns  cuDeviceGet
                    0.00%     964ns         3     321ns     262ns     426ns  cuDeviceGetAttribute
                    0.00%     868ns         1     868ns     868ns     868ns  cuDeviceComputeCapability
$

Thanks! Now it works quickly.

In my program after processing few image frames the program crashes with the following error. How can i fix it? I am using a jetson nano.

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/numba/cuda/cudadrv/driver.py", line 656, in _attempt_allocation
    allocator()
  File "/usr/lib/python3/dist-packages/numba/cuda/cudadrv/driver.py", line 671, in allocator
    driver.cuMemAlloc(byref(ptr), bytesize)
  File "/usr/lib/python3/dist-packages/numba/cuda/cudadrv/driver.py", line 288, in safe_cuda_api_call
    self._check_error(fname, retcode)
  File "/usr/lib/python3/dist-packages/numba/cuda/cudadrv/driver.py", line 323, in _check_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [2] Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "pynvrebel.py", line 3042, in <module>
    main()
  File "pynvrebel.py", line 141, in main
    nz_ba_h=get_non_zeros(bound_abstract_h)
  File "pynvrebel.py", line 467, in get_non_zeros
    non_zeros=cuda.device_array(shape=(non_zero_count),dtype=np.int32)
  File "/usr/lib/python3/dist-packages/numba/cuda/cudadrv/devices.py", line 212, in _require_cuda_context
    return fn(*args, **kws)
  File "/usr/lib/python3/dist-packages/numba/cuda/api.py", line 74, in device_array
    stream=stream)
  File "/usr/lib/python3/dist-packages/numba/cuda/cudadrv/devicearray.py", line 96, in __init__
    gpu_data = devices.get_context().memalloc(self.alloc_size)
  File "/usr/lib/python3/dist-packages/numba/cuda/cudadrv/driver.py", line 673, in memalloc
    self._attempt_allocation(allocator)
  File "/usr/lib/python3/dist-packages/numba/cuda/cudadrv/driver.py", line 663, in _attempt_allocation
    allocator()
  File "/usr/lib/python3/dist-packages/numba/cuda/cudadrv/driver.py", line 671, in allocator
    driver.cuMemAlloc(byref(ptr), bytesize)
  File "/usr/lib/python3/dist-packages/numba/cuda/cudadrv/driver.py", line 288, in safe_cuda_api_call
    self._check_error(fname, retcode)
  File "/usr/lib/python3/dist-packages/numba/cuda/cudadrv/driver.py", line 323, in _check_error
    raise CudaAPIError(retcode, msg)
numba.cuda.cudadrv.driver.CudaAPIError: [2] Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY

You’re running out of memory. If/since this is happening during a processing loop, its possible that you are creating allocations without releasing them. See here.

Yes the processing takes place inside a while loop. The out of memory occurs every time in the 15th iteration. Also at the time of crash there is around 1.5GB of free ram space according to ‘top’ utility. Any recommended value for NUMBA_CUDA_MAX_PENDING_DEALLOCS_COUNT or NUMBA_CUDA_MAX_PENDING_DEALLOCS_RATIO.

Thanks,

If I understand this parameter correctly, it seems like setting this to zero should be the first thing to try, as this should force all deallocations to occur immediately. Allowing no deallocations to be queued and thus pending should maximize the amount of free memory available at all times. The trade-off is likely lower performance, but before worrying about performance one would want to worry about functionality. You may also want to try increasing values for this configuration parameter step-by-step to see whether any interesting observations can be made.

What confuses me at the moment is that the documentation claims:

The deallocation queue is flushed automatically as soon as the following events occur:

An allocation failed due to out-of-memory error. Allocation is retried after flushing all deallocations.

Which to me means that the lazy flushing of the deallocation queue should be harmless from a functionality perspective, but could cause jittery performance, similar to garbage collection kicking in every now and then in other environments.

As this really seems to be a NUMBA design issue, that is, how NUMBA is using CUDA, I would suggest asking for advice on the NUMBA support platform.

Thanks! I set NUMBA_CUDA_MAX_PENDING_DEALLOCS_COUNT to zero. Now the program doesn’t crash.

Depending on the sparsity of your matrix, different approaches could run faster. E.g. if from 50 million elements only 3 elements are non-zero, calculating a prefix sum would be inefficient. If from 50 million elements just 3 elements are zero, there would also be faster ways. If you want a solution with predictable execution speed, which is currently fast enough (and would be for any other data), then you do not have to bother any further.

Hi,
Still facing issues related to CUDA_ERROR_OUT_OF_MEMORY error. I have set NUMBA_CUDA_MAX_PENDING_DEALLOCS_COUNT to zero. But the program crashes after some iterations.
For example

a=np.array([1000000],dtype=np.int)
a_d=cuda.to_device(a)

Is device memory cleared as soon as i run

a_h=a_d.copy_to_host()

According to docs "When the last reference to a device memory is dropped, the underlying memory is scheduled to be deallocated. What does that mean?

Or have i reached the limits of my hardware?

Perhaps reassign the a_d = None variable or del a_d. But it would probably be dropped at the next iteration anyway …