Trouble with uploading model from local

Hi,

I’m trying the Clara Train SDK. It’s been working great so far! However, I have trouble uploading the trained model from local The API Visualization Tool gives Error 500: Internal Server Error. Do you have any suggestions about that? Thanks!

Hi

Thanks for your in AIAA.

Could you provide more information on the steps you did? You should do he following:

  1. load the models to the server
  2. if you are using mitk, you need to go to nvidia AIAA preference page and point it to the server.
  3. check that the models are loaded successfully by listing them. Either from mitk multiLabel segmentation, "click here link ti see Details of available models" or just use briwser link http://:5000/v1/models

Hope that helps

Hi,

Thanks for your reply. I’m having trouble with the first step you mentioned. I tried to upload files as required from the AIAA server API page. I zipped a tensorflow model in checkpoint format and modified the model config. But after I hit the “execute” button, it just showed “Internal server error” and said that “Either the server is overloaded or there is an error in the application”. I think there’s something wrong with the model format. Also, I checked this page: http://:5000/v1/models. My model did not show up.

Could you please give me some more detailed instructions about uploading the model? Thanks!

Hi there,

Let’s first try uploading our model,
then you get the idea of how that works then we can see how to upload your model.

So you can first download the model.zip and config-aiaa.json in here: https://drive.google.com/drive/folders/1xLq4hsuCxf80HN2FKbUoWlY001Me3Nmo

then let’s create a directory to put these two thing inside,

after that you go to that directory and unzip the model.zip

then you do

curl -X PUT “http://127.0.0.1:5000/admin/model/byom_segmentation_spleen
-F “config=@config-aiaa.json;type=application/json”
-F “data=@model.zip”

Then when you see something in your terminal,
Check http://:5000/v1/models again the new model should be uploaded,

Let’s try this first and see if this works.
I just tried on my machine and it worked so that’s see if everything is right on your end

Thank you so much, I’ll definitely try it!

Also, could you please explain the difference between config_aiaa.json and config_inference.json? Can I pass the config_inference.json to AIAA server?

Thanks!

Hi,

I am assuming you are referring to the files you see in MMAR.

config_aiaa.json is for this Clara Train AIAA server to read and play with your model.
config_inference.json is for Clara Deploy to run inference on your model.

If you check them more closely you will see
for config_inference.json it has pre-transfroms of LoadNifty/Resample/Crop

while config_aiaa.json don’t have that because it assumes that the client side has done these works
[for ex: https://github.com/NVIDIA/ai-assisted-annotation-client/blob/master/cpp-client/src/client.cpp#L242]

You definitely don’t want to pass config_inferece.json to AIAA server

Thank you so much, that really answered my question!

I have encountered the same problem, what should I do?

I download the model.zip and config-aiaa.json and try your solution:
curl -X PUT “http://127.0.0.1:5000/admin/model/byom_segmentation_spleen
-F “config=@config-aiaa.json;type=application/json”
-F “data=@model.zip”

But the terminal display:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<title>500 Internal Server Error</title>
<h1>Internal Server Error</h1>
<p>The server encountered an internal error and was unable to complete your request.  Either the server is overloaded or there is an error in the application.</p>

when I check the Annotation server terminal ,it display:

(Env) crh_edu@server:~/nvidiaModels$ docker run --runtime=nvidia -it --rm -p 5000:5000 rayhhxxx/clara-train-sdk:v1.0-py3 start_aas.sh
                                                                                           
================
== TensorFlow ==
================

NVIDIA Release 19.02 (build 5618942)
TensorFlow Version v1.13.0-rc0

Container image Copyright (c) 2018, NVIDIA CORPORATION.  All rights reserved.
Copyright 2017-2018 The TensorFlow Authors.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying project or file.

NOTE: MOFED driver for multi-node communication was not detected.
      Multi-node communication performance may be reduced.

NOTE: The SHMEM allocation limit is set to the default of 64MB.  This may be
   insufficient for TensorFlow.  NVIDIA recommends the use of the following flags:
   nvidia-docker run --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 ...

Scheduler started
 * Serving Flask app "AIAAServer" (lazy loading)
 * Environment: production
   WARNING: This is a development server. Do not use it in a production deployment.
   Use a production WSGI server instead.
 * Debug mode: off
Creating a regular frozen graph from Checkpoint at '/var/nvidia/aiaa/downloads/host-5000/byom_segmentation_spleen' ...
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/tools/freeze_graph.py:264: FastGFile.__init__ (from tensorflow.python.platform.gfile) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.gfile.GFile.
Loaded meta graph file '/var/nvidia/aiaa/downloads/host-5000/byom_segmentation_spleen/model.ckpt.meta
WARNING:tensorflow:From /usr/local/lib/python3.5/dist-packages/tensorflow/python/tools/freeze_graph.py:127: checkpoint_exists (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to check for files with this prefix.
2019-09-09 04:03:44.396233: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2100000000 Hz
2019-09-09 04:03:44.399762: I tensorflow/compiler/xla/service/service.cc:161] XLA service 0x7f216bb4e8a0 executing computations on platform Host. Devices:
2019-09-09 04:03:44.399837: I tensorflow/compiler/xla/service/service.cc:168]   StreamExecutor device (0): <undefined>, <undefined>
2019-09-09 04:03:45.654295: I tensorflow/compiler/xla/service/service.cc:161] XLA service 0x7f216bc04170 executing computations on platform CUDA. Devices:
2019-09-09 04:03:45.654361: I tensorflow/compiler/xla/service/service.cc:168]   StreamExecutor device (0): GeForce RTX 2080 Ti, Compute Capability 7.5
2019-09-09 04:03:45.654382: I tensorflow/compiler/xla/service/service.cc:168]   StreamExecutor device (1): GeForce RTX 2080 Ti, Compute Capability 7.5
2019-09-09 04:03:45.654401: I tensorflow/compiler/xla/service/service.cc:168]   StreamExecutor device (2): GeForce RTX 2080 Ti, Compute Capability 7.5
2019-09-09 04:03:45.654419: I tensorflow/compiler/xla/service/service.cc:168]   StreamExecutor device (3): GeForce RTX 2080 Ti, Compute Capability 7.5
2019-09-09 04:03:45.656659: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:18:00.0
totalMemory: 10.76GiB freeMemory: 6.38GiB
2019-09-09 04:03:45.656752: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 1 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:3b:00.0
totalMemory: 10.76GiB freeMemory: 10.60GiB
2019-09-09 04:03:45.656815: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 2 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:86:00.0
totalMemory: 10.76GiB freeMemory: 38.56MiB
2019-09-09 04:03:45.656874: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 3 with properties:
name: GeForce RTX 2080 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.545
pciBusID: 0000:af:00.0
totalMemory: 10.76GiB freeMemory: 4.94GiB
2019-09-09 04:03:45.657455: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0, 1, 2, 3

Update:

To address the problem,I check the log of Flask server: http://127.0.0.1:5000/logs?lines=-1
Find that show:tensorflow.python.framework.errors_impl.InternalError: CUDA runtime implicit initialization on GPU:2 failed. Status: out of memory
So, I have solved this issue just by Killed the process that occupies the GPU.

Hi Ray,

Thanks for replying.
And thanks for the updating of your solution.
Let us know if you got more questions.

Thanks

Hello all! If I downloaded ready model from NGC and want to speed up startup process of trainSDK container and having models loaded after startup automatically which command should I execute to load local NGC model to annotation server? Should I change any config within model?
Could you provide the documentation page?
Thanks a lot!

If you have already downloaded the NGC models into your local path… you can avoid downloading it again by using following command…

If you have MMAR archive

curl -X PUT “http://127.0.0.1:5000/admin/model/segmentation_ct_spleen
-F “data=@segmentation_ct_spleen_mmar.tgz”

Another way shall be (in MMAR is unpacked):

curl -X PUT “http://127.0.0.1:5000/admin/model/segmentation_ct_spleen
-F “config=@configs/config-aiaa.json;type=application/json”
-F “data=@models/model.trt.pb”

But otherwise, still following command is best (less worry); let AIAA download

curl -X PUT “http://0.0.0.0:5000/admin/model/segmentation_ct_spleen
-H “accept: application/json”
-H “Content-Type: application/json”
-d ‘{“path”:“nvidia/med/segmentation_ct_spleen”,“version”:“1”}’

But when you run AIAA server, provide the workspace path (mounted through docker host) so that everytime you run AIAA docker (kill/start) you don’t have to install the models again

Run AIAA Server with Advanced options (e.g. mount workspace from host machine to persist models/logs/configs)

export AIAA_SERVER_PORT=5000
export LOCAL_WORKSPACE=/var/nvidia/aiaa
export REMOTE_WORKSPACE=/workspace

docker run $NVIDIA_RUNTIME
-it --rm -p $AIAA_SERVER_PORT:$AIAA_SERVER_PORT
-v $LOCAL_WORKSPACE:$REMOTE_WORKSPACE
$DOCKER_IMAGE
/bin/bash

start_aas.sh --workspace $REMOTE_WORKSPACE --port $AIAA_SERVER_PORT

I have tested this way. Works perfect, thanks a lot!
Could you provide links to documentation where I can read about this way?

Hi dmpopof,

here is the link: https://docs.nvidia.com/clara/aiaa/tlt-mi-ai-an-sdk-getting-started/index.html#aiaa_byom