Python3 core dumped

Hello,

I am using DeepStream 6.0.1 (official container) and the Python API on a Tesla T4. Recently, I introduced some code to support different secondary models on different cameras (as explained below). After this modification, my code started to show some “Python 3 Core dumped” errors without any further explanation. The application usually runs for at least a few minutes / hours and then it crashes.

Below, I am going to explain what I added to the code in order to support multiple cameras. Before this modification, I didn’t have any error.

Suppose I am processing 2 cameras. Both cameras needs to be processed by the same pgie but:

  • camera1 needs to run sgie1
  • camera2 needs to run sgie2

Here’s what I have done: I set both sgie1 and sgie2 to run on the pgie by setting gie-unique-id=1 for pgie and operate-on-gie-id=1 for sgie1 and sgie2. Then, I use some Gstreamer identify elements to change the pgie of each object before and after sgie1 and sgie2 so that the secondary models will run only on the desired objects.
More specifically:

  • Before and after every secondary model, I added a GStreamer identity element with a probe. So the pipeline looks like this: ... -> nvstreammux -> pgie -> pre_sgie1_identity -> sgie1 -> post_sgie1_identity -> pre_sgie2_identity -> sgie2 -> post_sgie2_identity
  • In the probe of pre_sgie1_identity I set the pgie of each object related to camera2 to obj_meta.unique_component_id=-1 so that sgie1 won’t run on objects from camera2
  • In the probe of pre_sgie2_identity I set the pgie of each object related to camera1 to obj_meta.unique_component_id=-1 so that sgie2 won’t run on objects from camera1
  • In the probes of post_sgie1_identity and post_sgie2_identity I set all the obj_meta.unique_component_idof every object (all cameras) to obj_meta.unique_component_id=1 to “reset” the status of the data

Here’s the probe that I use:


    def _identity_src_buffer_probe(self, pad, info, u_data):
    
        gst_buffer = info.get_buffer()
        if not gst_buffer:
            logger.error("Unable to get GstBuffer ")
            return

        # Retrieve batch metadata from the gst_buffer
        # Note that pyds.gst_buffer_get_nvds_batch_meta() expects the
        # C address of gst_buffer as input, which is obtained with hash(gst_buffer)
        batch_meta = pyds.gst_buffer_get_nvds_batch_meta(hash(gst_buffer))

        # Acquire lock
        pyds.nvds_acquire_meta_lock(batch_meta)

        # Get frame list
        l_frame = batch_meta.frame_meta_list

        batch_index = 0
        while l_frame is not None:

            # logger.info("Batch index", batch_index)
            batch_index += 1

            try:
                # Note that l_frame.data needs a cast to pyds.NvDsFrameMeta
                # The casting also keeps ownership of the underlying memory
                # in the C code, so the Python garbage collector will leave
                # it alone.
                frame_meta = pyds.NvDsFrameMeta.cast(l_frame.data)
            except StopIteration:
                break

            # Get pgie_id
            pgie_id = ......... # function to get the id of the pgie, it can return either 1 or -1

            # Iterate over objects
            l_obj = frame_meta.obj_meta_list
            while l_obj is not None:
                try:
                    # Casting l_obj.data to pyds.NvDsObjectMeta
                    obj_meta = pyds.NvDsObjectMeta.cast(l_obj.data)

                except StopIteration:
                    break

                # Change ID of the primary model
                obj_meta.unique_component_id = pgie_id

                try:
                    l_obj = l_obj.next
                except StopIteration:
                    break

            try:
                l_frame = l_frame.next
            except StopIteration:
                break

        # Release lock
        pyds.nvds_release_meta_lock(batch_meta)

        return Gst.PadProbeReturn.OK

As explained above, after introducing the identity elements and their probes, I see some Python 3 Core dumped error. I am assuming this is happening because the probe access the metadata and maybe, after thousands / millions of execution, an edge case is found which falls into a bug. It might not be the case, but I wanted to ask you if you think that the code above is correct from a theoretical standpoint.

I also realized that I don’t know if and when nvds_acquire_meta_lock is needed. I thought it would create a lock to prevent two parts of code to access the same metadata. However the following code can acquire two locks on the same NvDsBatchMeta instance:

import pyds

batch_meta = pyds.nvds_create_batch_meta(max_batch_size=16)

pyds.nvds_acquire_meta_lock(batch_meta)
print('acquiring lock 1')
pyds.nvds_acquire_meta_lock(batch_meta)
print('acquiring lock 2')

This code will print both “acquiring lock 1” and “acquiring lock 2”. Is this expected (the reason being that both calls of nvds_acquire_meta_lock origins from the same thread)? Am I using it correctly in the code above? The documentation only says that it is a lock to be acquired before updating metadata. However, it does not seem to lock anything.

As always, thank you for your help!

Hello @mfoglio ,
To narrow down the issue, do you have a chance to analyze the coredump file, and is there any information on which part/function causes the coredump?
Thanks.

We are not sure.

nvds_acquire_meta_lock is implemented in c, it calls g-rec-mutex-lock.

It works in different threads.

Hello, thank you @yingliu and @Fiona.Chen .

@Fiona.Chen : now I understand why the lock didn’t work in my dummy example: it works in different threads. Thanks.

I am still facing the core dump error in my pipeline.

@yingliu I am uploaded a coredump file here core.182.zip - Google Drive . As a Python developer I don’t know what do to with it.

I am confident that the core dumped error is caused by the code I posted above. I think the error must be in _identity_src_buffer_probe. I say this for two reasons:

Reason 1
If I run the same pipeline without that probe (i.e. I don’t attach the probe to the identity element) I don’t have errors.

Reason 2
I tried to replace the identity elements in the code above with queue elements because queues create new threads. The code crashes with a core dump error.
I checked dmesg -T output, and it contains the following line:

audit: type=1701 audit(1657573814.610:180): auid=4294967295 uid=0 gid=0 ses=4294967295 pid=2038280 comm="YOLO_V5_SECONDA" exe="/usr/bin/python3.6" sig=6 res=1

With respect to this error, I can say that YOLO_V5_SECONDA it’s the beginning of the a Gstreamer queue element name. So again, there must have been a crash in _identity_src_buffer_probe.

Problem / Question Summary
To sum up, in the probe _identity_src_buffer_probe I am editing obj_meta.unique_component_id for all objects of all frame. This seems to cause a core dumped error every once in a while. Is this the correct way to edit the metadata? I have never had issues in adding metadata (e.g. custom postprocessing of a detector model). I looked at the Python examples and it seems I am doing it the correct way, even though there’s not a specific example on how to edit objects metadata.

I am assuming the probe is blocking and that therefore there shouldn’t be two concurrent access to the same memory. But chances are that I am wrong, since there is a core dumped error.

Attached the pdf of the pipeline before the crash. I doubt you’ll need it. I am attaching it only to help you understand where I used the identity elements whose probe (_identity_src_buffer_probe) is causing the error.
pipeline-19.pdf (116.4 KB)

I was doing some tests on acquiring / release the meta lock.
Is this normal that this stalls in a deadlock?
Code:

import threading
import pyds


def func(batch_meta, thread_number: int):
    print('acquiring lock', thread_number)
    pyds.nvds_acquire_meta_lock(batch_meta)
    print('lock acquired', thread_number)
    pyds.nvds_release_meta_lock(batch_meta)
    print('lock released', thread_number)


batch_meta = pyds.nvds_create_batch_meta(max_batch_size=16)


threads = [threading.Thread(target=func, args=(batch_meta, thread_number)) for thread_number in range(10)]
for thread in threads:
    thread.start()

Output:

acquiring lock 0
lock acquired 0
acquiring lock 1
# deadlock here

A similar code that uses Python threads lock does not stall:
Code:

import threading


def func(lock, thread_number: int):
    print('acquiring lock', thread_number)
    lock.acquire(blocking=True)
    print('lock acquired', thread_number)
    lock.release()
    print('lock released', thread_number)


lock = threading.Lock()


threads = [threading.Thread(target=func, args=(lock, thread_number)) for thread_number in range(10)]
for thread in threads:
    thread.start()

Output:

acquiring lock 0
acquiring lock 1
acquiring lock 2
lock acquired 1
lock released 1
acquiring lock 3
lock acquired 0
lock released 0
acquiring lock 4
lock acquired 2
lock released 2
lock acquired 4
lock released 4
lock acquired 3
acquiring lock 5
lock released 3
acquiring lock 6
lock acquired 5
lock released 5
acquiring lock 7
lock acquired 6
lock released 6
lock acquired 7
acquiring lock 8
lock released 7
lock acquired 8
acquiring lock 9
lock released 8
lock acquired 9
lock released 9

Note: if you don’t see the deadlock in the first sample try to run it again or to increase the number on threads. If you still don’t see it, replace the func function with this:


def func(batch_meta, thread_number: int):
    print('acquiring lock', thread_number)
    pyds.nvds_acquire_meta_lock(batch_meta)
    print('lock acquired', thread_number)
    pyds.nvds_release_meta_lock(batch_meta)
    print('lock released', thread_number)

EDIT: I noticed something weird. If you run this code:

import threading
import pyds


def func(batch_meta, thread_number: int):
    print('acquiring lock', thread_number)
    pyds.nvds_acquire_meta_lock(batch_meta)
    print('lock acquired', thread_number)
    pyds.nvds_release_meta_lock(batch_meta)
    print('lock released', thread_number)


batch_meta = pyds.nvds_create_batch_meta(max_batch_size=16)


threads = [threading.Thread(target=func, args=(batch_meta, thread_number)) for thread_number in range(10)]
for i, thread in enumerate(threads):
    print('Starting thread', i)
    thread.start()

The output is the following:

Starting thread 0
acquiring lock 0
Starting thread 1
lock acquired 0
lock released 0
acquiring lock 1
Starting thread 2
lock acquired 1
lock released 1
acquiring lock 2
Starting thread 3
lock acquired 2
acquiring lock 3
lock released 2
lock acquired 3
Starting thread 4
lock released 3
acquiring lock 4
Starting thread 5
lock acquired 4
lock released 4
acquiring lock 5
Starting thread 6
lock acquired 5
lock released 5
acquiring lock 6
Starting thread 7
acquiring lock 7
lock acquired 6
Starting thread 8
# deadlock here

According to the output, the weird thing is that not all threads are initialized.

You may try to update class-id (instead of component-id) of one of the cameras and instruct SGIE to work on other source. OR After PGIE use 2 demux components and then add corresponding SGIE in pipeline.

Hi @Fiona.Chen , I ended up updating class-id to run different secondary models on different camera sources. In order to do this I had to use the class-id property in an unintended way. All the other ways I tried results in core dumped every once in a while.
It would be great in the future if you could add the option to run a model only on certain cameras. I guess it should be too hard and it is probably beneficial for other users.
Thank you!

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.