Pycuda :: OverflowError: can't convert negative value to unsigned int

I have trained a classification model with pytorch backend in TAO Toolkit 5.0 and generated TensorRT engine. When running inference with the engine in PyCUDA with the following code:

# Load the TRT engine
engine_file = '/home/nvidia/pycuda/FAN/classification_model_export_9.engine'
with open(engine_file, 'rb') as f, trt.Runtime(trt.Logger()) as runtime:
    engine_data =
    engine = runtime.deserialize_cuda_engine(engine_data)

# Create the context and allocate memory
context = engine.create_execution_context()
inputs = []
outputs = []
bindings = []
stream = cuda.Stream()

for binding in engine:
    size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size
    dtype = engine.get_binding_dtype(binding)
    # Allocate device memory for inputs/outputs
    device_mem = cuda.mem_alloc(size * trt.float32.itemsize)
    # Append to the appropriate list
    if engine.binding_is_input(binding):

# Load the label file
label_file = '/home/nvidia/pycuda/FAN/labels.txt'
with open(label_file, 'r') as f:
    labels =


def preprocess_image(image):
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
    resized_image = cv2.resize(image, (224, 224)).astype(np.float32)
    resized_image -= np.array([103.939, 116.779, 123.68], dtype=np.float32)
    resized_image = np.transpose(resized_image, (2, 0, 1))
    resized_image = np.expand_dims(resized_image, axis=0)
    return resized_image

def infer_image(image):
    print("Inferencing started")
    # Copy input image to device
    cuda.memcpy_htod_async(inputs[0], image.ravel(), stream)

    # Run inference
    context.execute_async(bindings=bindings, stream_handle=stream.handle)

    # Synchronize the stream

    # Get the output label
    output = np.empty(trt.volume(engine.get_binding_shape(engine[engine.num_bindings - 1])), dtype=np.float32)  # Output shape
    cuda.memcpy_dtoh_async(output, outputs[0], stream)
    cuda.memcpy_dtoh(output, outputs[0])

    # Get the predicted label
    label_id = np.argmax(output)

    # Return the predicted label
    return labels[label_id]

def classify_image(input_image):

    image = cv2.imread(input_image)

    # Preprocess the image
    preprocessed_image = preprocess_image(image)

    # Run inference on the preprocessed image
    predicted_label = infer_image(preprocessed_image)
    print("Predicted = ", predicted_label)
    return predicted_label

I get the following error:

[10/05/2023-19:54:40] [TRT] [E] 1: [raiiMyelinGraph.h::RAIIMyelinGraph::24] Error Code 1: Myelin (Compiled against cuBLASLt but running against cuBLASLt
/home/nvidia/pycuda/examples/ DeprecationWarning: Use get_tensor_shape instead.
  size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size
/home/nvidia/pycuda/examples/ DeprecationWarning: Use network created with NetworkDefinitionCreationFlag::EXPLICIT_BATCH flag instead.
  size = trt.volume(engine.get_binding_shape(binding)) * engine.max_batch_size
[10/05/2023-19:54:40] [TRT] [W] The getMaxBatchSize() function should not be used with an engine built from a network created with NetworkDefinitionCreationFlag::kEXPLICIT_BATCH flag. This function will always return 1.
/home/nvidia/pycuda/examples/ DeprecationWarning: Use get_tensor_dtype instead.
  dtype = engine.get_binding_dtype(binding)
Traceback (most recent call last):
  File "/home/nvidia/pycuda/examples/", line 104, in <module>
    device_mem = cuda.mem_alloc(size * trt.float32.itemsize)
OverflowError: can't convert negative value to unsigned int

I have upgraded the TensorRT version to to match the TensorRT engine generated from TAO Toolkit. I have also upgraded to CUDA 12.0 and CUDNN

The code previously runs fine with CUDA 11.8 and TensorRT with different model trained with TAO 4 so I can assume the code is okey, but there might be some version incompatibility with the exported model. The error Compiled against cuBLASLt but running against cuBLASLt gives an insight but I’m not sure how to check the cuBLAS version or how to change the version. Please help.

Please refer to the following similar issue, which may help you.

I have ended up the pycuda and successfully using nvidia tao_deploy for pytorch based models

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.