Demultiplexing data from a stream file

I’m actually working on a project consisting in demultiplexing data from a stream.

People who need to retrieve informations on the stream have the following elements :

  • the stream file
  • an XML file describing data structures present in the stream : structures name, elements present in the structure (name and type), the number of bits to read to obtain the element, the way to read data (LE ou BE), encodage type (bin, hex, dec)

I have to do benchmark to determine if GPGPU can give us better performance.
So I coded in C, different test version : single threaded, multi-threaded.

To do this, I extracted all data from the XML file, and created a “dictionnary” (.h file)
This dictionnary contains the following elements :

  • Data extraction functions :
  • void extract_bits_8(void* pParam1, void* pParam2, void* pBuffer, void* pDest) // Returns extracted bits as an unsigned char (8 bits)

    • void extract_bits_16(void* pParam1, void* pParam2, void* pBuffer, void* pDest) // Returns extracted bits as an unsigned long (16 bits)

    Here is a description of the parameters :
    // @param[in] pParam1 = bitStart - Start data extraction from this bit
    // @param[in] pParam2 = bitLen - Number of bits to read
    // @param[in] pBuffer - Stream to read
    // @param[in/out] pDest - Destination pointer, to write extracted bits

  • Arrays describing each elements present in the structure :
    [indent] - Position in bits of the element

    • Number of bits to read
    • Offset in the output structure to store decoded data
    • convertion function to apply
    • debug info : full name (Structure_X:Element_Y)

    Here is a sample :

    BitDesc Structure_X_BitDesc =
    [indent] …
    { 48, 32, offsetof(Structure_X, Element_Y), extract_bits_32, “Structure_X::Element_Y”},

  • Global variables to store decoded data
    [indent] Structure_X g_structure_x;[/indent]

  • An array describing all possible structure :
    [indent] - Structure name

    • structure size in bytes
    • number of elements presents in this structure
    • Link to the array describing the structure and each elements ( Structure_X_BitDesc )
    • Link to the global variable in which to store decoded data

    Here is a sample :
    const PacketType LIST_TYPES =

    {“Structure_X”, 40, 14, Structure_X_BitDesc, g_structure_x},

Up to here, it was “easy”.
The stream is always composed of the same packet, which appears always in the same order : debug info1, header, technical header, technical data, technical data error

Depending of an element in the header packet, I can determine the right “technical header” structure (which could be Structure_X, Structure_Y, …)
So this special element is an enumeration of possible value It’s my “conditional element”.

  • So I create a Packet array :
    [indent] - Pointer to the “conditional element”. Null if there is NO condition

    • Pointer to a list containing an association “value - Structure to read”
    • Value of the first possible value in the conditional element
    • Number of elements in the list of possible value

    Here is a sample :
    const Packet LIST_PACKET =

    {NULL, (void*)&LIST_TYPES[1], 0, 0},
    {&, &g_Condition_Value_Structure_X, 80, 8},

  • Arrays containing association “value - Structure to read” :
    Here is a sample :
    void* g_Condition_Value_Structure_X =
    { (void*)&LIST_TYPES[3] }, // Value 80. Link to the “Structure_X”

Well, I’ve written my C code to do this :

  • create a pool of thread
  • read data from the stream and store readed content into a buffer accessible to all threads
  • Loop (for) browsing each element in the LIST_PACKET array
  • test if there is a conditional element, get the link to the array descriibng each elements present in the structure -> Structure_X_BitDesc
  • loop browsing each elements of this array (Structure_X_BitDesc)

[indent]- Here threads are waiting to resume

  • Indicate to a structure common to all threads, data present in BitDesc structure (Position in bits, number of bits, offset in the output structure, convertion function to apply)
  • Resume threads
  • Wait that threads reports they have decoded all packet elements[/indent]

Here is what threads do :

  • enter in an infinite loop upto they receive a specific signal
  • depending of the thread index, te thread decode a specific element. (Sample, in an array of 15 elements, If I have 3 threads, the thread 0 will decode the element 0, 3, 6, …). With this strategy I don’t have read lock
  • loop : while there are data to decode, do the job, else send a signal indicating the work is finished
    • if there is data to decode, read common data, and apply the decode function with all these parameters
  • wait to resume

So I’ve to write a cuda version based on the multithreaded C version.
Do you think CUDA can help me to obtain better performance?
Can you give me advices on how to write my code… essentially on memory management.

Thanks in advance for your help.