Performance of l4t-pytorch on cuda and cpu

Hi,

I am using the l4t-pytorch image to run a pytorch model on Jetson AGX Xavier. Same code is being run on cuda and cpu on NVP model MAXN, but the cpu beats the cuda big time. Is that expected?

Using cpu device
epoch : 1/10, time = 327.4649398326874, loss = 0.054947

Using cuda device
epoch : 1/10, time = 562.3225209712982, loss = 0.053142

Hi @nouuata, is this just the first epoch? Does the GPU time decrease after the first epoch?

What kind of model are you training?

Hi @dusty_nv,

no, the time doesn’t improve significantly. The model is simple, batch size is 32, the input data is 10M points x 3 floats:

        self.hidden1 = Linear(n_inputs, 3)
        kaiming_uniform_(self.hidden1.weight, nonlinearity='relu')
        self.act1 = ReLU()
        self.hidden2 = Linear(3, 1)
        xavier_uniform_(self.hidden2.weight)
        self.act2 = Sigmoid()

I guess it must be the following lines:

inputs = inputs.to(device)
...
targets = targets.to(device)

from the following snippet:

    model = Model().to(device)
    criterion = BCELoss()
    learning_rate = 1e-4
    optimizer = Adam(model.parameters(), lr=learning_rate)

    for epoch in range(epochs):
        for i, (inputs, targets) in enumerate(train_dl):
            inputs = inputs.to(device)
            optimizer.zero_grad()
            outputs = model(inputs)
            targets = targets.unsqueeze(1)
            targets = targets.float()
            targets = targets.to(device)
            loss = criterion(outputs, targets)
            loss.backward()
            optimizer.step()

By the way, how big / fast is the GPU RAM on AGX Xavier? Does it possess the regular cuda memory model?

You can also call .cuda() or .to(device) on criterion. You can also create your dataloader with pin_memory=True. It may be this model is too small/simple to benefit much from GPU acceleration. If you were to time the difference with convnet (for example ResNet18), you should find GPU to be much faster than CPU.

The GPU shares the same physical RAM with the CPU on Jetson - so on AGX Xavier, it is the full 32GB (minus a small amount reserved for the kernel)

Hi @dusty_nv,

I tried all suggestions. None would improve the performance, so it should be the simplicity of the model.

This is quite interesting because I actually created a foreground-background subtraction algo which is quite accurate and almost linear, which runs on opencv for 2ms, and I thought I can keep / improve the performance while reducing the computation time with a simple ANN, and I am really surprised with this result.

This topic was automatically closed 60 days after the last reply. New replies are no longer allowed.