Hello! Hope you’re having a great day, it’s 1AM here :)
I’m currently working on creating a high-performance, optimized library with gradient aggregation rules in our ML application for very large tensors/vectors, which is already speeding up our work quite well. However, some parts of the code is in PyTorch, as running CUDA with a C++ script integrated in python often doesn’t provide the best results, it requires lots of data transfer between the CPU and GPU almost every time.
However, I still want to utilize CUDA more. I’m wondering, with the new Toolkit, is there a way to efficiently run CUDA in python? And if so, how to do that?
Thanks for the answer in advance!