CUDA out of memory error after adding a PointwiseMonitor

TinsenLY · February 21, 2023, 12:52pm

Hi,

I trained a model using Modulus 22.09 to predict the flow field for a 2D airfoil with varying angle of attack and inlet velocity. I am trying to calculate the error between the model’s predictions and the validation data, specifically the error in u, v, and p. To do this, I created a PointwiseMonitor to calculate the desired error values.

However, I encountered a CUDA out of memory error as the following:

RuntimeError: CUDA out of memory. Tried to allocate 218.00 MiB (GPU 0; 31.74 GiB total capacity; 30.33 GiB already allocated; 183.38 MiB free; 30.61 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

The PointwiseMonitor is implemented as the following:

#openfoam_var is the validation data 
temp1 = {
    key: value
    for key, value in openfoam_var.items()
    if key in ["x", "y","aoa","vel_in","u", "v", "p"]
}

error_uvp = PointwiseMonitor(
    invar = temp1,
    output_names=["u__x", "u__y", "v__x", "v__y","p"],
    metrics={
    "error_u"+str(num_aoa)+"_"+str(vel_in):
        lambda var: torch.mean(torch.sqrt(var["u__x"]**2+var["u__y"]**2)-var["u_op"]),
    "error_v"+str(num_aoa)+"_"+str(vel_in):
        lambda var: torch.mean(torch.sqrt(var["v__x"]**2+var["v__y"]**2)-var["v_op"]),
    "error_p"+str(num_aoa)+"_"+str(vel_in):
        lambda var: torch.mean(var["p"]-var["p_op"]),
    },
    nodes=flow_nodes,
)
domain.add_monitor(error_uvp)

I suspect that this section of the code may be the cause. I am currently investigating the issue and am unsure why it is happening. I was wondering if there is a better way to calculate the error between the model output and target values, and would appreciate any help in advance.

Thank you.

ngeneva · February 25, 2023, 2:57am

Hi @TinsenLY

I would encourage you to use validator nodes for comparing between target and output data. For example the annular ring is a good example that uses openfoam data from a CSV.

Monitors have no batch processing (typically just use for like a few points of interest, like pressure at the nose of a object in a fluid flow). So thats the reason your seeing a memory issue. Validators will process things in batches which you can define.