Cannot connect to process and Stuck in "Searching for attachable processes ..."

Hi,
I am trying to use the latest Nsight Compute GUI (2022.2.0) to connect to the remote Linux server on my MacOS. Surely I can directly connect to this a100 machine with ssh -p xxxx username@xxx.xxx.xx.xx. The problem is that I cannot connect to the process as shown below

I used the Interactive Profile, and I also took some attempts in the console of the a100 machine. The binary ./a.outworks well. If I simply run ncu ./a.out, everything is fine. But when I tried ncu --mode=launch ./a.out, it will stuck in ==PROF== Waiting for profiler to attach on ports 49152-49215. When stuck here, I checked the port 49152 with lsof and it shows like this
image

The lsof command in other time would return null except when I am stuck in waiting. I also tried other ports just as suggested but they also behaved the same. Note that I do not have the sudo privilege. I don’t know if this is a port conflict problem or connection failure or just because I do not have the Administration permission? Or did I just miss something?

Could anyone please give me some advice? Thanks a lot.

It seems you are doing the right steps, but there may be some problems with the selected (default) ports. Can you please confirm that you tried

ncu --mode=launch ./a.out

on the remote target machine, and it would stop in “Waiting for profiler to attach”? That would be the expected behavior, as the application is launched on the target system and then suspended in the first CUDA API call, waiting for the host, in this case the local UI, to attach.

Can you please try

  • Launching the application on the remote target system using ncu --mode launch app, followed by
  • Using the local host UI’s Interactive Profile activity in “Attach” mode/tab, with your remote system selected in the connection dialog? Does this show the remote process available to attach?

Also, does the remote file selection work, i.e. while having the remote system selected in the connection dialog, click the “…” button next to “Application Executable”?

Thanks. I have tried the following steps:

  1. Run ncu --mode=launch ./a.out, and it is waiting like
    image

  2. Now I try to use the “Attach” mode but nothing appears, like this (no matter how I refresh)

  3. I checked the remote file selection, and it works fine.

  4. I ctrl+c the ncu on the remote system and in most cases nothing returns. But in a few tries, it exit like this (I don’t know why)

  5. The connection preferences is like this

Can you try with some command line utility if a TCP connection can be established on any of these ports, e.g. with netcat or iperf3:

Target machine: nc -l 49152
Host machine: nc <ip> 49152

or

Target machine: iperf3 -s -p 49152
Host machine: iperf3 -p 49152

If these don’t work either, this may simply be a firewall problem between these machines. Depending on your local permissions and policies, you could either

  • ask the admin to open these ports for TCP
  • use an ssh config to tunnel these ports over ssh, since that port is open
  • collect the report using ncu -o remotely and copy it over to your local system to open it in the UI

Many thanks! It seems that it is a firewall problem indeed. I would really appreciate it if you could give me another hand and teach me more about the ssh tunnel.

Suppose the server IP is IP1 and the target machine takes port PortA. So definitely I can connect to the target machine with ssh -p PortA username@IP1. I tried the command ssh -L 50152:IP1:49152 username@IP1 -p PortA to build the tunnel. But it did not work. Where is the problem with this port forwarding? Thanks!

As stated in the previous answer, the easiest way to work around this if you
don’t need interactive profiling would be to collect the report on the remote
machine with the command line profiler.

If you need interactive profiling, it may be possible to workaround your
firewall issue by using NVIDIA Nsight Compute’s support for the SSH
ProxyJump/ProxyCommand option.

When detecting that the remote connection uses a proxy command to create the
socket connected to the remote SSH server, it sets up a local SOCKS proxy from
the local machine to the target and transparently forwards all connections
through that tunnel.

To use this functionality, you will have to setup your local SSH configuration
to use a proxy command to connect to the target and set that proxy command to
jump through your local machine before connecting to the target.

Assuming the hostname or IP address of the target machine is <target_host>, you
could add the following lines to ~/.ssh/config:

Host <target_host>
	ProxyJump localhost

In order for this to work, you will need to start an OpenSSH server on your
local machine and authenticate to the local machine through SSH keys.

To check everything is setup correctly, from the local machine with the modified
SSH configuration, you can try:

$ ssh <target_host>

If everything is setup correctly, you should not be prompted for authentication
to the local machine and should successfully connect to the target host.

On some macOS machines, the installed version of the OpenSSH client does not
support the ProxyJump option but does support the ProxyCommand one. If this is
the case, you can replace the ProxyJump localhost line in the configuration
snippet above by ProxyCommand ssh -W localhost.

Once this is setup, you should be able to do interactive profiling without
changing the connection target in the connection settings dialog.
When successful, you should see Started SSH SOCKS proxy on port: <port> in the log messages.

You may also refer to the Remote Connections documentation for further information.

Thank you very much! I didn’t expect that your reply could be so in detail! Thanks for your nice reply.

I have tried what you suggested:

  1. Modify the SSH configuration like this:
Host a100
    ProxyJump localhost
    HostName xxx.xxx.xx.xx
    Port xxxx
    User xxxx
  1. Now if I ssh a100, I will be asked to enter the passwords of my local machine.

  2. Add SSH keys on both my local machine and the target host, so now I can directly ssh a100 without any authentication prompts.
    image

  3. However, nothing happens and I am still stuck in the looping. The log messages are exactly the same as before. Sad.

Have you adjusted the connection settings within Nsight Compute to now also connect to “a100”, rather than the original IP address? Otherwise, Nsight Compute would not take advantage of your ProxyJump configuration.

Yes. Sure. I always connect to the “a100”… The difference is that now I don’t need to enter the password in the Connection Dialog.

I also had this problem when I connect to the remote server.

But when I do ssh via X-window
local:~$ssh -X linux.server
linux:~$ncu-ui

I can launch it normally and the process number attached is 484131

Could it be that the limited process range in the Nsight Compute settings causes the problems?