I’m having complete c code of machine learning model which is well optimized for cpu. Now want to make it multichannel . suppose lets say my gpu contains 3054 cores , shall I run my process in each core and make it possible to run 3054 processes parallelly with different inputs .
Is there any method/framework to do it directly with c code , or do we need to convert it to cuda code and then make it possible to run multiple channels at a time.
our goal is to maximize the number channels we can run simultaneously in gpu .