Failed Watchdog timeout in thread server_main in CloudXRRemoteHMD after 10.440015 seconds. Aborting

I’m currently seeing this error at the end of the vrserver log when I start SteamVR. This is causing SteamVR to croak, and block the CloudXR add-on on the next start. Is this something that I can configure around? Can I turn off the watchdog for the CloudXR threads?

Hi @marc24 - Can you please send over your server logs?

Thanks!

Veronica

Hi Veronica,

Attached are the CloudXR server logs, and the Steam logs. The machine has just booted, and we’re starting SteamVR automatically on user login with a Windows scheduled task. It comes up and shows the “Unexpected SteamVR Error” dialog that says “SteamVR has encountered a critical error.” and has buttons to restart or quit SteamVR. These log snapshots are taken at this point. If I then restart SteamVR, it will come up in safe mode with the Gamepad and CloudXR add-ons blocked. Please let me know if there is more info I can gather for you. Thanks.

/marc

cloudxrlogs.zip (1.1 KB)
steamlogs.zip.zip (850.9 KB)

Hi @Veronica_NVIDIA,

We have also tested this with CloudXR 3.0 and are seeing the same issue. The “Known issues” part of the docs talks about SteamVR coming up in safe mode and recommends using SteamVR Beta. Is that what you would suggest for this issue? Thanks.

/marc

Also adding a Steam report for one of our instances, in case that provides information that is useful…

/marc

SteamVR-2021-09-03-AM_09_15_09.txt (463.0 KB)

We are seeing the exact same issue on our Windows VMs running CloudXR Server. SteamVR runs for about ~10 seconds before reporting the Critical Error (and offering to Quit or Restart SteamVR).

Failed Watchdog timeout in thread server_main in CloudXRRemoteHMD after 11.886055 seconds. Aborting.

The error definitely originates from the CloudXRRemoteHMD but the CloudXR Server / Streamer Server logs do not provide any indication of the error. Is there any way to obtain more information / logfiles about what happens in the CloudXRRemoteHMD?

It is worth noting that this kind of behaviour seems to only happen right after booting a machine so it is probably some kind of race condition at system startup and maybe specific to virtual machines (with virtualized hardware). Once manually restarting SteamVR afterwards, everything usually works as expected but a manual restart is not really an option for us so we would really love to automate this procedure and wait for the system to be in a state where the watchdog timeout does not occur. The problem is we don’t know which target state this is.

Did you find out any solution / workaround @marc24?

Hi @vrdev,

Thanks for prodding me to reply. We did figure out a workaround last week. It turns out this issue is related specifically to AWS machines with EBS. When you start an EBS-backed machine in AWS, the EBS volume needs to be hydrated from the snapshot that it is based on. The snapshot data is backed by S3, and therefore is very slow for the first read of data from the volume. We are now using a little PowerShell script to basically cat {files} > /dev/null for all the files in the CloudXR and SteamVR directories before we start SteamVR. This hydrates those files into the volume so that they are fast, and we haven’t had issues crashing since then. More information on this is here: Initialize Amazon EBS volumes - Amazon Elastic Compute Cloud . Hope this helps.

/marc

Hello @marc24 ,
I am running into a similar issue (when I launch an instance based off of an AMI that needs to saturate an EBS volume from a snapshot) where the cloudXR addon is blocked on steamVR when it is automatically opened on login, or vrstartup.exe itself runs into a problem. I looked into how you solved it and the link you put in your response, but I am having difficulty understanding exactly how you did it. If you could provide more detail on how you made the powershell script to cat {files} > /dev/null for all the files in the CloudXR and SteamVR directories before starting steamVR, that would be greatly appreciated!

Hello
It’s been a while since this issue has been discussed.
The issue is still happening today, with CloudXR 3.2 and the latest version of SteamVR, with the same root cause (timeout in steamvr related to file stored in an EBS).
Still we made a script that preloads all the files in CloudXR & SteamVR folder and still see the same issue. It can take up to 15 minutes to get all the files accessed.
Maybe we got the wrong list of files which are critical ?
Can someone having success with the method share a script that works ?
Best regards,
Alexis

I tried to do a similar thing with a Powershell script and yes it does take a long time to access all the files. Another approach I took was having a PowerShell script running in the background which identified when SteamVR crashes due to a critical error, restarts it, and unblocks CloudXR in the steamvr.vrsettings file. However, due to the lazy loading, this restart and unblocking might happen multiple times before steamVR finally works. Once the connection is made to the CloudXR client, steamVR can crash again (maybe a couple of times). With enough restarts, it works fine.

I know this is not the most effective method but in my experience, even with all the restarts it takes less time than accessing all the steamVR files to get the connection up and running.

Here are code snippets from my script that is relevant to what I wrote above.

function unblockCloudXR{
    $file = "C:\Program Files (x86)\Steam\config\steamvr.vrsettings"
    $content = Get-Content $file -Raw
    $repstring = '"blocked_by_safe_mode" : false'
    $newContent = $content -replace '"blocked_by_safe_mode" : true', $repstring
    Set-Content $file $newContent

}

function steamVRCrashed{
    #this issue arises typically when launched first time from an AMI, WerFault process asks to stop Vrserver
    if (Get-Process -Name "WerFault" -ErrorAction SilentlyContinue) {
        Write-Host "WerFault is running."
        #need to close both vrserver and vrmonitor (vrmonitor does not close vrserver)        
        return $true        
    }
    
    #this error typically also arises after steamvr is restarted after the issue above. It can happen multiple times.
    $file = "C:\Program Files (x86)\Steam\config\steamvr.vrsettings"

    if(Select-String -Path $file -Pattern '"blocked_by_safe_mode" : true'){
        Write-Host "CloudXR is blocked."
        return $true
    }

    #This occurs when steamvr runs into a critical error.
    if (Get-Process -Name "vrmonitor" -ErrorAction SilentlyContinue){
        if ( -not (Get-Process -Name "vrserver" -ErrorAction SilentlyContinue)){
            Write-Host "there has been a crash"
            return $true            
        }
    }
    return $false

}

function startup{
    param(
        [string]$instance_id
    )

    #stop all relevant processes
    if (Get-Process -Name "WerFault" -ErrorAction SilentlyContinue) {
        Write-Host "WerFault will now be closed."
        Stop-Process -Name "WerFault"
    }
    if (Get-Process -Name "vrserver" -ErrorAction SilentlyContinue){
        Write-Host "vrserver will now be closed"
        Stop-Process -Name "vrserver"
    }
    if (Get-Process -Name "vrmonitor" -ErrorAction SilentlyContinue){
        Write-Host "vrmonitor will now be closed"
        Stop-Process -Name "vrmonitor"
    }

    unblockCloudXR

    #get start time of script
    $scriptStartTime = Get-Date
    Start-Sleep -Seconds 1

    #start steamVR
    & "C:\Program Files (x86)\Steam\steamapps\common\SteamVR\bin\win64\vrstartup.exe"

    $process1 = "vrmonitor"
    $process2 = "vrserver"

    while ($true) {
        #if (steamVRCrashed){return}
        if ((Get-Process -Name $process1 -ErrorAction SilentlyContinue) -and (Get-Process -Name $process2 -ErrorAction SilentlyContinue)) {
            Write-Host "$process1 and $process2 are both running."
            break
        }
        Start-Sleep -Seconds 1
    }
}


while ($true){
    startup -instance_id $instance_id
    #check if client has been disconnected
    while ($true){
        if (steamVRCrashed){
            break
        }
        
        Start-Sleep -Seconds 1

    }
}

Hello
Thank you so much for your suggestion. This was really helpful.
The script you shared does exactly the job and the error is finally gone.
Best regards,
Alexis