Without referring to user space, I’m curious whether the GPU will signal the CPU proactively, in a manner that the Linux kernel can handle, once it completes a kernel computation.
I don’t know if this is what you’re really asking, but the GPU should generate a hardware IRQ at certain points. Most likely at the point you are asking about a hardware IRQ is generated. That IRQ is not likely to be specific to completion of a computation; many computations might be completed before the IRQ.
Thank you for your response. I would like to confirm whether the GPU signals the completion of a kernel computation using an IRQ. Additionally, does the GPU send the IRQ upon the completion of the kernel computation itself, or after the kernel computation’s output has been transferred from the GPU task memory space to the CPU process memory space?
Someone from NVIDIA would have to confirm that. Any transfer of memory though from one hardware (GPU) to another (CPU) would result in at least one hardware IRQ. Details would require going into the source code of either (A) the driver being invoked or left due to IRQ, and (B) the scheduler policies. That’s such a general thing I don’t think I can give you a useful answer.
Hi,
Could you share more about your use case?
In general, the CPU waits for the GPU task done with a synchronization call.
Thanks.
Thank you for your response. I aim to implement a function within the Linux kernel that monitors the completion status of a specified GPU kernel. Therefore, I’m currently exploring which task completion signals could be used to reliably determine the end of a task.
Hi,
We need to check with our internal team.
Will update more information with you later.
Thanks.
Hi,
Here is the info from our internal team.
It’s possible to force recover a TSG, but that is not terminating a specific task.
Since we do not track launches in kernel space, we cannot terminate a particular launch.
Nor is there an API to do such a thing nor is that even conceptually possible because one does not kill tasks per-se, rather you can reset an engine, running a particular task (defined here as a discrete batch of work submitted to the GPU, not a TSG).
If you can provide more info about the intended use case, it may be possible to help.
Thanks.
Thank you for your response. But your response seems to not answer my this question? My question is which task completion signals could be used to reliably determine the end of a task in Linux kernel.
Hi,
We are still waiting for the answer to this question.
In the meantime, could you provide more info about the intended use case?
Thanks.
Thank you for your comprehensive response. My specific requirement involves a Linux kernel mechanism designed to identify and handle a malicious GPU kernel that is unjustly monopolizing GPU resources. My objective is to send software signals to the GPU kernel and upon determining whether it does not complete promptly (at this stage, I require a dependable signal to confirm the kernel’s termination). If it does not complete after received software signal, I need to forcefully terminate it to prevent potential misuse of resources(here I need a way to force terminate the malicious GPU kernel).
If it were user space, this would be “kill -9
”. There are zombie processes and other places in the kernel where some intervention is needed to get to what is there, but there is a distinction between a missing process which is still scheduled, versus a process which is not responding, and is still scheduled. I’m kind of mumbling here because it is an interesting problem. In most cases it is the scheduler which determines when a process shifts out of context or not, and cleanup through the scheduler. I guess I’m an odd person, because now you have me wondering if the GPU itself has some form of hardware-based scheduler or internal method of doing the equivalent of a scheduler and accepting kill commands to force a process (or thread) out of context.
Hi,
Here is the suggestion after our internal discussion:
Please build monitoring and closing of the channels via CUDA APIs.
Is there any limitation and why do you need to talk to Kernel?
Thanks.
This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.