TAO API dataset upload: overflow error (string longer than 2147483647 bytes)

ganinduN · March 27, 2023, 2:15pm

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc)

A100

• How to reproduce the issue ?
run TAO/getting_started_v4.0.0/notebooks/tao_api_starter_kit/api/end2end/ssd.ipynb

I get to the point where I upload data (approx 5 Gb)

# Upload
files = [("file",open(train_dataset_path,"rb"))]

endpoint = f"{base_url}/dataset/{dataset_id}/upload"

response = requests.post(endpoint, files=files, headers=headers, verify=rootca)

print(response)
print(response.json())

I get this overflow error

OverflowError                             Traceback (most recent call last)
/tmp/ipykernel_147669/3804223585.py in 
      4 endpoint = f"{base_url}/dataset/{dataset_id}/upload"
      5 
----> 6 response = requests.post(endpoint, files=files, headers=headers, verify=rootca)
      7 
      8 print(response)

~/.pyenv/versions/3.7.16/envs/PY37/lib/python3.7/site-packages/requests/api.py in post(url, data, json, **kwargs)
    113     """
    114 
--> 115     return request("post", url, data=data, json=json, **kwargs)
    116 
    117 

~/.pyenv/versions/3.7.16/envs/PY37/lib/python3.7/site-packages/requests/api.py in request(method, url, **kwargs)
     57     # cases, and look like a memory leak in others.
     58     with sessions.Session() as session:
---> 59         return session.request(method=method, url=url, **kwargs)
     60 
     61 

~/.pyenv/versions/3.7.16/envs/PY37/lib/python3.7/site-packages/requests/sessions.py in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    585         }
...
-> 1003             return self._sslobj.write(data)
   1004         else:
   1005             return super().send(data, flags)

OverflowError: string longer than 2147483647 bytes

I assumed (even though I don.t understand why this is happening) This was because of the size of the training dataset and modified the code with

with open(train_dataset_path, "rb") as f:
    response = requests.post(endpoint, headers=headers, data=f, stream=True, verify=rootca) # stream with 1mb chunks
    response.raise_for_status()

With this I get a code 500 (internal server error)

HTTPError                                 Traceback (most recent call last)
/tmp/ipykernel_189197/1635968353.py in 
      1 with open(train_dataset_path, "rb") as f:
      2     response = requests.post(endpoint, headers=headers, data=f, stream=True, verify=rootca) # stream with 1mb chunks
----> 3     response.raise_for_status()

~/.pyenv/versions/3.7.16/envs/PY37/lib/python3.7/site-packages/requests/models.py in raise_for_status(self)
   1019 
   1020         if http_error_msg:
-> 1021             raise HTTPError(http_error_msg, response=self)
   1022 
   1023     def close(self):

HTTPError: 500 Server Error: INTERNAL SERVER ERROR for url: https://aisrv.gnet.lan:31549/api/v1/user/f2d3c55a-f3dd-5dff-badc-851e27460122/dataset/32a03a22-8a6f-4a4c-a6c0-93373e9638b0/upload

The setup is a k8 cluster with a master (24 core cpu srver with 16 Gb RAM, Ubuntu Server 22.04 ) and a GPU node (Nvidia DGX A100 Station)

The jupyter notebook server is run within the DGX (virtualenv python 3.7.16).

Cheers,
Ganindu.

P.S.

however the smaller evaluation dataset (approx 800mb) uploaded sucessfully (Response 201)

files = [("file",open(eval_dataset_path,"rb"))]

endpoint = f"{base_url}/dataset/{eval_dataset_id}/upload"

response = requests.post(endpoint, files=files, headers=headers, verify=rootca)

print(response)
print(response.json())

Morganh · March 28, 2023, 1:45pm

How about uploading smaller dataset, for example, 500M?

ganinduN · March 28, 2023, 2:13pm

That is what I actually did!

I split the dataset to smaller chunks (the rule (if kitti) is to have images and labels subdirectories) and do a POST request for each chunk (tar.gz) (this is why the earlier chunking in streaming or when I tried to split the archive did not work)

I think as long as img, label pairs are unique the data would append

Also I knew where the storage was mounted to the pod so I could have put it there.

finally I think changing the ds_upload.py (to take in post split datasets) in the handler or the /etc/nginx/nginx.conf (to make streaming possible) might help but I think it kind of breaks the nice things about kuberneties (so I didn’t try that)

I only asked the question because I was wondering how the people who wrote the example notebook did it (even though I think the limitation is in the python request library)

Thanks for the answer @Morganh (if that’s the official recommendation I will accept, though I’m curious how they did it in the example notebook because they would have downloaded the same 5gb dataset)

Morganh · March 28, 2023, 2:37pm

2147483647 bytes is 2GB. So, suggest to use a smaller chunk which is smaller than 2GB.

Morganh · April 3, 2023, 6:14am

Add more, you can use tao client instead of tao api.

ganinduN · April 3, 2023, 9:47am

I went for splitting data to accomodate the python requests library (in my python 3.7.7 env) (I’ve been using the client before) . like I mentioned earlier I was just curious how they make the example work as it is because the dataset that gets downloaded from the url dends up being around 5gb (in the traning set).

I think the simple solution (I think the example was made in a python environment >= 3.10) is using python 3.10 or above as mentioned in this post by IBM

which says

“This is caused by the Python requests library that uses a Python SSL library that does not support objects that are larger than 2 GB. This issue is fixed in Python 3.10.”

TL;DR:

The downloaded dataset in the example from

IMAGES_URL = "https://s3.eu-central-1.amazonaws.com/avg-kitti/data_object_image_2.zip"
LABELS_URL = "https://s3.eu-central-1.amazonaws.com/avg-kitti/data_object_label_2.zip"

is over the 2gb limit (I think the training dataset gets to around +5GB if you just follow the steps in the example notebook).

My solutuon (I was using a python 3.7.7 environment)

I made this shell script split a large dataset into smaller chunks and upload with subsequent calls. (the chunks I make in that is around 1 gb

you can edit the line below to change it (size in bytes)

max_size=$((1024 * 1024 * 1024))

and then POSTed the chunks to make the dataset (POST until you run out of chunks to POST)

system · April 17, 2023, 9:48am

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
An error occurred while preparing the data set using TAO TAO Toolkit	14	1365	October 19, 2021
TAO Toolkit - FPENet - Dataset_Convert error TAO Toolkit	14	714	October 6, 2023
TAO toolkit happend some .so bug TAO Toolkit tao	19	903	September 9, 2022
Convert custom dataset using nvidia tao TAO Toolkit tao	2	431	June 14, 2023
Tao toolkit Error while fetching server API version TAO Toolkit	19	1861	June 15, 2023
TAO Toolkit Version 5.3 - Segformer ValueError: need at least one array to concatenate TAO Toolkit	14	626	April 16, 2024
Error in TAO-Toolkit while training TAO Toolkit	15	1492	July 6, 2022
Offline data augmentation for maskrcnn TAO Toolkit	7	833	May 11, 2022
TAO faster_rcnn not working TAO Toolkit	19	491	February 22, 2022
High ram usage with tlt ResNet TAO Toolkit	42	992	July 6, 2022

TAO API dataset upload: overflow error (string longer than 2147483647 bytes)

Related topics