Speeding up frame conversion

I am working on a Jetson Xavier AGX board which is under Ubuntu 18.04. OpenCV 4.5.1 and Cuda 10.2 are built on the board.

From a video I get frames, these frames are converted into float32 using :

numpy.float32(frame)

Then a numpy array is used to store these float 32 frames.

We want to speed up the processing of each frame, the instruction which takes the more time, about 50%, is the conversion to float32.

To speedup this instruction I tried to use Cupy, a package which use GPU acceleration to do numpy instruction but cupy.float32(frame) is as long as numpy’s. The purpose of this conversion is to convert these frames into Tensorflow Tensor.

I also tried to use Numba which is supposed to run loops asynchronously if i’m not mistaken. The conversion using numpy was not compatible with it.

So, how can I speedup this conversion ? Why Cupy instruction is as long as Numpy ?

I was also thinking doing this in C++ but I’m not sure it will speedup processing because python seems fast at doing operation on matrices. Does anyone knows if it will be interesting ?

Thanks for your time,
I hope there are all the information needed,
Cheers.