TAO API dataset upload: overflow error (string longer than 2147483647 bytes)

Please provide the following information when requesting support.

• Hardware (T4/V100/Xavier/Nano/etc)

A100

• How to reproduce the issue ?
run TAO/getting_started_v4.0.0/notebooks/tao_api_starter_kit/api/end2end/ssd.ipynb

I get to the point where I upload data (approx 5 Gb)

# Upload
files = [("file",open(train_dataset_path,"rb"))]

endpoint = f"{base_url}/dataset/{dataset_id}/upload"

response = requests.post(endpoint, files=files, headers=headers, verify=rootca)

print(response)
print(response.json())

I get this overflow error

OverflowError                             Traceback (most recent call last)
/tmp/ipykernel_147669/3804223585.py in 
      4 endpoint = f"{base_url}/dataset/{dataset_id}/upload"
      5 
----> 6 response = requests.post(endpoint, files=files, headers=headers, verify=rootca)
      7 
      8 print(response)

~/.pyenv/versions/3.7.16/envs/PY37/lib/python3.7/site-packages/requests/api.py in post(url, data, json, **kwargs)
    113     """
    114 
--> 115     return request("post", url, data=data, json=json, **kwargs)
    116 
    117 

~/.pyenv/versions/3.7.16/envs/PY37/lib/python3.7/site-packages/requests/api.py in request(method, url, **kwargs)
     57     # cases, and look like a memory leak in others.
     58     with sessions.Session() as session:
---> 59         return session.request(method=method, url=url, **kwargs)
     60 
     61 

~/.pyenv/versions/3.7.16/envs/PY37/lib/python3.7/site-packages/requests/sessions.py in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    585         }
...
-> 1003             return self._sslobj.write(data)
   1004         else:
   1005             return super().send(data, flags)

OverflowError: string longer than 2147483647 bytes

I assumed (even though I don.t understand why this is happening) This was because of the size of the training dataset and modified the code with

with open(train_dataset_path, "rb") as f:
    response = requests.post(endpoint, headers=headers, data=f, stream=True, verify=rootca) # stream with 1mb chunks
    response.raise_for_status()

With this I get a code 500 (internal server error)

HTTPError                                 Traceback (most recent call last)
/tmp/ipykernel_189197/1635968353.py in 
      1 with open(train_dataset_path, "rb") as f:
      2     response = requests.post(endpoint, headers=headers, data=f, stream=True, verify=rootca) # stream with 1mb chunks
----> 3     response.raise_for_status()

~/.pyenv/versions/3.7.16/envs/PY37/lib/python3.7/site-packages/requests/models.py in raise_for_status(self)
   1019 
   1020         if http_error_msg:
-> 1021             raise HTTPError(http_error_msg, response=self)
   1022 
   1023     def close(self):

HTTPError: 500 Server Error: INTERNAL SERVER ERROR for url: https://aisrv.gnet.lan:31549/api/v1/user/f2d3c55a-f3dd-5dff-badc-851e27460122/dataset/32a03a22-8a6f-4a4c-a6c0-93373e9638b0/upload

The setup is a k8 cluster with a master (24 core cpu srver with 16 Gb RAM, Ubuntu Server 22.04 ) and a GPU node (Nvidia DGX A100 Station)

The jupyter notebook server is run within the DGX (virtualenv python 3.7.16).

Cheers,
Ganindu.

P.S.

however the smaller evaluation dataset (approx 800mb) uploaded sucessfully (Response 201)

files = [("file",open(eval_dataset_path,"rb"))]

endpoint = f"{base_url}/dataset/{eval_dataset_id}/upload"

response = requests.post(endpoint, files=files, headers=headers, verify=rootca)

print(response)
print(response.json())

How about uploading smaller dataset, for example, 500M?

That is what I actually did!

I split the dataset to smaller chunks (the rule (if kitti) is to have images and labels subdirectories) and do a POST request for each chunk (tar.gz) (this is why the earlier chunking in streaming or when I tried to split the archive did not work)

I think as long as img, label pairs are unique the data would append

Also I knew where the storage was mounted to the pod so I could have put it there.

finally I think changing the ds_upload.py (to take in post split datasets) in the handler or the /etc/nginx/nginx.conf (to make streaming possible) might help but I think it kind of breaks the nice things about kuberneties (so I didn’t try that)

I only asked the question because I was wondering how the people who wrote the example notebook did it (even though I think the limitation is in the python request library)

Thanks for the answer @Morganh (if that’s the official recommendation I will accept, though I’m curious how they did it in the example notebook because they would have downloaded the same 5gb dataset)

2147483647 bytes is 2GB. So, suggest to use a smaller chunk which is smaller than 2GB.

1 Like

Add more, you can use tao client instead of tao api.

I went for splitting data to accomodate the python requests library (in my python 3.7.7 env) (I’ve been using the client before) . like I mentioned earlier I was just curious how they make the example work as it is because the dataset that gets downloaded from the url dends up being around 5gb (in the traning set).

I think the simple solution (I think the example was made in a python environment >= 3.10) is using python 3.10 or above as mentioned in this post by IBM

which says

This is caused by the Python requests library that uses a Python SSL library that does not support objects that are larger than 2 GB. This issue is fixed in Python 3.10.

TL;DR:

The downloaded dataset in the example from

IMAGES_URL = "https://s3.eu-central-1.amazonaws.com/avg-kitti/data_object_image_2.zip"
LABELS_URL = "https://s3.eu-central-1.amazonaws.com/avg-kitti/data_object_label_2.zip"

is over the 2gb limit (I think the training dataset gets to around +5GB if you just follow the steps in the example notebook).

My solutuon (I was using a python 3.7.7 environment)

I made this shell script split a large dataset into smaller chunks and upload with subsequent calls. (the chunks I make in that is around 1 gb

you can edit the line below to change it (size in bytes)

max_size=$((1024 * 1024 * 1024))

and then POSTed the chunks to make the dataset (POST until you run out of chunks to POST)

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.