Troubleshooting Large File Loading Issues in Omniverse Nucleus on AWS Cloud

Hi all,
I am from the NTT DATA team, and we have a cloud infrastructure on AWS with a dedicated machine for Omniverse Nucleus. We are experiencing issues when loading USD scenes composed of many (and large) files. For loading, we use the omni.client library with the copy_async function within a REST server. During the file loading process, after uploading some files, when a large-sized file (1-2 GB) is encountered, it results in an error, and the response we receive is “Result.Error.” So:

  1. How can we obtain more detailed error logs from the omni.client library? “Result.Error” doesn’t provide much information.

  2. After the first error, all Nucleus services stop working. Restarting the instance doesn’t change anything; we have to recreate the server from scratch to get it working again.

  3. We are unable to determine which container logs need to be checked and whether there is a log level to set. The only error we found in the logs is in the nucleus_thumbnails container, and it is as follows:

File "/omni/create_thumbnails.py", line 428, in handle_task
  await create_thumbnails_cached(connection, file_transfer, file_path=path, file_hash=lm.hash_value,
File "/omni/prometheus_utils.py", line 97, in func_wrapper
  return await func(*args, **kwargs)
File "/omni/create_thumbnails.py", line 341, in create_thumbnails_cached
  system_thumb_hash = await upload_thumb(conn, file_transfer, system_thumb_path, thumbnail_data)
File "/omni/create_thumbnails.py", line 319, in upload_thumb
  async with await file_transfer.create(path=system_thumb_path,
File "/omni/_deps/omniverse_connection/omni/lft.py", line 69, in __aexit__
  await self.end()
File "/omni/_deps/omniverse_connection/omni/lft.py", line 165, in end
  raise FileTransferException(str(result.status))
omni.lft.FileTransferException: (INTERNAL_ERROR)

To provide more context, what we are doing is extracting files from a zip file (for testing purposes, this file weighs 4 GB, and when extracted, it’s approximately 20 GB). These files are then sent from the REST server to Nucleus. However, as I explained, after a certain number of files have been uploaded, Nucleus stops working. Just to provide additional information, the upload process always gets stuck with the same file (size: 1.8 GB). I tried uploading this file individually, and I didn’t encounter any problems. However, when I try to upload all the files, the issue occurs. Some of the tests I’ve conducted include the following, but none have yielded improvements:

  1. I scaled the containers responsible for transfer (nucleus-lft) from 1 to 3.
  2. I used an instance with more RAM and CPU.
  3. I allocated 5 to 8 GB of RAM to the nucleus-lft containers.

Thanks for your support