Streaming audio data format queston

I have converted one int16 sound bytes to float32 bytes (in Python), when streaming the float32 4 bytes data to the audio2face tool, it does not work as desired.
My question is what is the exact stream data format? According to comments in the code, it only says: “audio_data: bytes, containing audio data for the whole track, where each sample is encoded as 4 bytes (float32)”.
I binary-compared my sound file, and the values are “equal” and “orders” are the same(I tried to normalize the float32 data, but it didn’t work), only different in data type, one as int16 and another as float32.
The attachment zip contains 2 files, sound_tts_sf.wav for int16 and it played well, another never worked correctly.
Can anyone give me any suggestions?

samples.zip (25.6 KB)

Hi

I have successfully streamed audio stream to A2F. You need use gRPC API.
There are two approaches, chunk by chunk , or long-running streaming.
Either way would work. The audio data supplied to stream is MONO channel in the format of AV_SAMPLE_FMT_FLTP, sample rate must be specified in gRPC request.

syntax = "proto3";

package nvidia.audio2face;

service Audio2Face {
    rpc PushAudio(PushAudioRequest) returns (PushAudioResponse) {}
    rpc PushAudioStream(stream PushAudioStreamRequest) returns (PushAudioStreamResponse) {}
}

message PushAudioRequest {
    string instance_name = 1;
    int32 samplerate = 2;
    bytes audio_data = 3;
    bool block_until_playback_is_finished = 4;
}

message PushAudioResponse {
    bool success = 1;
    string message = 2;
}

message PushAudioStreamRequest {
    oneof streaming_request {
        PushAudioRequestStart start_marker = 1;
        bytes audio_data = 2;
    }
}

message PushAudioRequestStart {
    string instance_name = 1;
    int32 samplerate = 2;
    bool block_until_playback_is_finished = 3;
}

message PushAudioStreamResponse {
    bool success = 1;
    string message = 2;
}

I use the test_client.py demo code, sample rate has been set to 24000(in my case 24kHz), and it always plays the sound distorted and noise.