TAO API dataset upload: overflow error (string longer than 2147483647 bytes)

• Hardware (T4/V100/Xavier/Nano/etc)


• How to reproduce the issue ?
run TAO/getting_started_v4.0.0/notebooks/tao_api_starter_kit/api/end2end/ssd.ipynb

I get to the point where I upload data (approx 5 Gb)

# Upload
files = [("file",open(train_dataset_path,"rb"))]

endpoint = f"{base_url}/dataset/{dataset_id}/upload"

response = requests.post(endpoint, files=files, headers=headers, verify=rootca)


I get this overflow error

OverflowError                             Traceback (most recent call last)
/tmp/ipykernel_147669/3804223585.py in 
      4 endpoint = f"{base_url}/dataset/{dataset_id}/upload"
----> 6 response = requests.post(endpoint, files=files, headers=headers, verify=rootca)
      8 print(response)

~/.pyenv/versions/3.7.16/envs/PY37/lib/python3.7/site-packages/requests/api.py in post(url, data, json, **kwargs)
    113     """
--> 115     return request("post", url, data=data, json=json, **kwargs)

~/.pyenv/versions/3.7.16/envs/PY37/lib/python3.7/site-packages/requests/api.py in request(method, url, **kwargs)
     57     # cases, and look like a memory leak in others.
     58     with sessions.Session() as session:
---> 59         return session.request(method=method, url=url, **kwargs)

~/.pyenv/versions/3.7.16/envs/PY37/lib/python3.7/site-packages/requests/sessions.py in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    585         }
-> 1003             return self._sslobj.write(data)
   1004         else:
   1005             return super().send(data, flags)

OverflowError: string longer than 2147483647 bytes

I assumed (even though I don.t understand why this is happening) This was because of the size of the training dataset and modified the code with

with open(train_dataset_path, "rb") as f:
    response = requests.post(endpoint, headers=headers, data=f, stream=True, verify=rootca) # stream with 1mb chunks

With this I get a code 500 (internal server error)

HTTPError                                 Traceback (most recent call last)
/tmp/ipykernel_189197/1635968353.py in 
      1 with open(train_dataset_path, "rb") as f:
      2     response = requests.post(endpoint, headers=headers, data=f, stream=True, verify=rootca) # stream with 1mb chunks
----> 3     response.raise_for_status()

~/.pyenv/versions/3.7.16/envs/PY37/lib/python3.7/site-packages/requests/models.py in raise_for_status(self)
   1020         if http_error_msg:
-> 1021             raise HTTPError(http_error_msg, response=self)
   1023     def close(self):

HTTPError: 500 Server Error: INTERNAL SERVER ERROR for url: https://aisrv.gnet.lan:31549/api/v1/user/f2d3c55a-f3dd-5dff-badc-851e27460122/dataset/32a03a22-8a6f-4a4c-a6c0-93373e9638b0/upload

The setup is a k8 cluster with a master (24 core cpu srver with 16 Gb RAM, Ubuntu Server 22.04 ) and a GPU node (Nvidia DGX A100 Station)

The jupyter notebook server is run within the DGX (virtualenv python 3.7.16).



however the smaller evaluation dataset (approx 800mb) uploaded sucessfully (Response 201)

files = [("file",open(eval_dataset_path,"rb"))]

endpoint = f"{base_url}/dataset/{eval_dataset_id}/upload"

response = requests.post(endpoint, files=files, headers=headers, verify=rootca)


How about uploading smaller dataset, for example, 500M?

That is what I actually did!

I split the dataset to smaller chunks (the rule (if kitti) is to have images and labels subdirectories) and do a POST request for each chunk (tar.gz) (this is why the earlier chunking in streaming or when I tried to split the archive did not work)

I think as long as img, label pairs are unique the data would append

Also I knew where the storage was mounted to the pod so I could have put it there.

finally I think changing the ds_upload.py (to take in post split datasets) in the handler or the /etc/nginx/nginx.conf (to make streaming possible) might help but I think it kind of breaks the nice things about kuberneties (so I didn’t try that)

I only asked the question because I was wondering how the people who wrote the example notebook did it (even though I think the limitation is in the python request library)

Thanks for the answer @Morganh (if that’s the official recommendation I will accept, though I’m curious how they did it in the example notebook because they would have downloaded the same 5gb dataset)

2147483647 bytes is 2GB. So, suggest to use a smaller chunk which is smaller than 2GB.

Add more, you can use tao client instead of tao api.

I went for splitting data to accomodate the python requests library (in my python 3.7.7 env) (I’ve been using the client before) . like I mentioned earlier I was just curious how they make the example work as it is because the dataset that gets downloaded from the url dends up being around 5gb (in the traning set).

I think the simple solution (I think the example was made in a python environment >= 3.10) is using python 3.10 or above as mentioned in this post by IBM

which says

This is caused by the Python requests library that uses a Python SSL library that does not support objects that are larger than 2 GB. This issue is fixed in Python 3.10.


The downloaded dataset in the example from

IMAGES_URL = "https://s3.eu-central-1.amazonaws.com/avg-kitti/data_object_image_2.zip"
LABELS_URL = "https://s3.eu-central-1.amazonaws.com/avg-kitti/data_object_label_2.zip"

is over the 2gb limit (I think the training dataset gets to around +5GB if you just follow the steps in the example notebook).

My solutuon (I was using a python 3.7.7 environment)

I made this shell script split a large dataset into smaller chunks and upload with subsequent calls. (the chunks I make in that is around 1 gb

you can edit the line below to change it (size in bytes)

max_size=$((1024 * 1024 * 1024))

and then POSTed the chunks to make the dataset (POST until you run out of chunks to POST)

