JetRacer: out of memory when training

RogerZ1128 · July 3, 2020, 6:37am

Hi，
I’ve followed the instructions on GitHub - NVIDIA-AI-IOT/jetracer: An autonomous AI racecar using NVIDIA Jetson Nano and successfully run the training as well as roadfollowing with default resnet18. I noticed that there are some more models to choose. For some need, I’d like to try other models that take less system memory. As far as I know, Alexnet and squeezenet are lighter than resnet18. However, when I run training with these models, system will be out of memory and crashed. What’s the problem？Did I miss some settings?
Here are parts of the code about training.

import torch
import torchvision

device = torch.device('cuda')
output_dim = 2 * len(dataset.categories)  # x, y coordinate for each category

# ALEXNET
# model = torchvision.models.alexnet(pretrained=True)
# model.classifier[-1] = torch.nn.Linear(4096, output_dim)

# SQUEEZENET 
# model = torchvision.models.squeezenet1_1(pretrained=True)
# model.classifier[1] = torch.nn.Conv2d(512, output_dim, kernel_size=1)
# model.num_classes = len(dataset.categories)

# RESNET 18
model = torchvision.models.resnet18(pretrained=True)
model.fc = torch.nn.Linear(512, output_dim)

# RESNET 34
# model = torchvision.models.resnet34(pretrained=True)
# model.fc = torch.nn.Linear(512, output_dim)

# DENSENET 121
# model = torchvision.models.densenet121(pretrained=True)
# model.classifier = torch.nn.Linear(model.num_features, output_dim)

model = model.to(device)

model_save_button = ipywidgets.Button(description='save model')
model_load_button = ipywidgets.Button(description='load model')
model_path_widget = ipywidgets.Text(description='model path', value='road_following_model.pth')

def load_model(c):
    model.load_state_dict(torch.load(model_path_widget.value))
model_load_button.on_click(load_model)
    
def save_model(c):
    torch.save(model.state_dict(), model_path_widget.value)
model_save_button.on_click(save_model)

model_widget = ipywidgets.VBox([
    model_path_widget,
    ipywidgets.HBox([model_load_button, model_save_button])
])

display(model_widget)

BATCH_SIZE = 8

optimizer = torch.optim.Adam(model.parameters())
# optimizer = torch.optim.SGD(model.parameters(), lr=1e-3, momentum=0.9)

epochs_widget = ipywidgets.IntText(description='epochs', value=1)
eval_button = ipywidgets.Button(description='evaluate')
train_button = ipywidgets.Button(description='train')
loss_widget = ipywidgets.FloatText(description='loss')
progress_widget = ipywidgets.FloatProgress(min=0.0, max=1.0, description='progress')

def train_eval(is_training):
    global BATCH_SIZE, LEARNING_RATE, MOMENTUM, model, dataset, optimizer, eval_button, train_button, accuracy_widget, loss_widget, progress_widget, state_widget
    
    try:
        train_loader = torch.utils.data.DataLoader(
            dataset,
            batch_size=BATCH_SIZE,
            shuffle=True
        )

        state_widget.value = 'stop'
        train_button.disabled = True
        eval_button.disabled = True
        time.sleep(1)

        if is_training:
            model = model.train()
        else:
            model = model.eval()

        while epochs_widget.value > 0:
            i = 0
            sum_loss = 0.0
            error_count = 0.0
            for images, category_idx, xy in iter(train_loader):
                # send data to device
                images = images.to(device)
                xy = xy.to(device)

                if is_training:
                    # zero gradients of parameters
                    optimizer.zero_grad()

                # execute model to get outputs
                outputs = model(images)

                # compute MSE loss over x, y coordinates for associated categories
                loss = 0.0
                for batch_idx, cat_idx in enumerate(list(category_idx.flatten())):
                    loss += torch.mean((outputs[batch_idx][2 * cat_idx:2 * cat_idx+2] - xy[batch_idx])**2)
                loss /= len(category_idx)

                if is_training:
                    # run backpropogation to accumulate gradients
                    loss.backward()

                    # step optimizer to adjust parameters
                    optimizer.step()

                # increment progress
                count = len(category_idx.flatten())
                i += count
                sum_loss += float(loss)
                progress_widget.value = i / len(dataset)
                loss_widget.value = sum_loss / i
                
            if is_training:
                epochs_widget.value = epochs_widget.value - 1
            else:
                break
    except e:
        pass
    model = model.eval()

    train_button.disabled = False
    eval_button.disabled = False
    state_widget.value = 'live'
    
train_button.on_click(lambda c: train_eval(is_training=True))
eval_button.on_click(lambda c: train_eval(is_training=False))
    
train_eval_widget = ipywidgets.VBox([
    epochs_widget,
    progress_widget,
    loss_widget,
    ipywidgets.HBox([train_button, eval_button])
])

display(train_eval_widget)

Hope for replies, thanks.

AastaLLL · July 3, 2020, 8:00am

Hi,

Would you mind to reboot the device and try it again?
Sometime DL frameworks occupied too much memory and leads to out of memory.

Thanks.

Topic		Replies	Views
Jetson-inference: cannot train model with custom data set Jetson Nano jetson-inference	10	2131	February 17, 2022
Jetson Nano 2GB Killed (Out Of Memory) During Re-Training Jetson Nano ai-training	19	3467	November 8, 2021
Training custom model on Jetson Nano doesnt work Jetson Nano jetson-inference , ai-training	4	604	January 22, 2024
Segmentation fault (core dumped) on jetson nano when training resnet-18 on my small dataset of just 60 images using transfer learning! Jetson Nano ai-training	7	2500	July 16, 2020
Jetson-inference: Retraining cat_dog using train.py is not running Jetson Nano	7	1070	January 11, 2020
Out of memory error from TensorFlow: any workaround for this, or do I just need a bigger boat? Jetson Nano	12	14764	January 29, 2026
Out of memory during training Jetson Nano jetson-inference , ai-training	7	2464	July 12, 2021
UserWarning: This overload of nonzero is deprecated: (Extremely slow model training) Jetson Nano ai-training	3	1257	April 27, 2021
Screen freezes when training model Jetson Nano ai-training	3	930	March 28, 2023
Training on jetson nano is killed Jetson Nano ai-training	2	492	February 12, 2024

JetRacer: out of memory when training

Related topics