Originally published at: https://developer.nvidia.com/blog/accelerating-python-applications-with-cunumeric-and-legate/
cuNumeric is an endeavoring drop-in replacement for NumPy based on the Legion programming system.
Whether cumumeric exports native BLAS API calls to be picked up in cython with cimport
, the same way as numpy or scipy do (e.g. cython_blas.pxd) ?
There are no plans to provide an interface at the level of the BLAS API. The BLAS API operates at the level of raw pointers and strides, and this is too low-level for cuNumeric to target effectively. cuNumeric aims to provide an implementation that can scale to multiple nodes, and thus may need to split the data across multiple address spaces, meaning that a single pointer would not be a good representation for an array. The NumPy level, which uses abstract array objects, is easier to transparently re-implement as a distributed library.
We do provide an “entry point” into cuNumeric, where a cuNumeric array can be initialized from an existing memory buffer. Then you can use the NumPy API to operate on that array, and cuNumeric will take care of sharding the data, parallelizing the work etc. The final result at the end of this process can be “inline mapped”, which pulls all data to one place and provides it as a local buffer. This happens automatically if you try to do anything with a cuNumeric array that requires all the data to be in one place, e.g. printing it out.
Note that it’s likely not optimal to be switching in and out of cuNumeric repeatedly, as every switch causes blocking and other overheads.