I want to implement a kernel-level scheduling function, which means that the kernel is scheduled to execute on different GPUs according to the user’s submission of the kernel to improve the overall performance of the GPU cluster. For example, using cudaSetDevice(0), but the device usage of 0 is already 100%, so the scheduler can schedule it to execute on other idle GPUs in the node.
I am so far able to hook the cuda driver API (cuda.h) successfully, as that library dynamically links with executables (we use LD_PRELOAD and dlsym). I can also hook the cuda runtime API successfully except cudaSetDevice. When I hook cudaSetDevice, I will fall into an infinite loop. I can’t figure out the cause of the problem. Can you help me? Or give me some advice about the kernel-level of scheduling.?