Output changes for the same input when the neural net has been run for several times?

Ubuntu 16.04
NVIDIA 1080 ti
driver 384
CUDA 9.0.176
CUDNN 7.1.4
Python 3.5
Tensorflow 1.8
TensorRT 4

We created a neural network, froze it, converted it to uff an ran on C++. (influenced by sampleUffMNIST.cpp in TensorRT4 samples). Output changes for the same input, whenever we run it several time.

The problem occurs even with a simple 4 layer residual network and one output. Basically, given the same uff file, the portion of the c++ code that parses it and builds the network does it differently every time.
After that the code runs fine; I’ve added a loop around execute to show that the outputs are the same after loading the uff file.
So I suspect something is wrong with uff file that is causing the parser to mis-parse it every time.

Please help me on the matter.
can you reproduce the error?
The outputs will not be the same at the 4th decimal place even when run 10 times in a row, and 4th decimal place errors cannot be due to rounding errors.
Anyone familiar with the error? any solutions?

The problem increases as we add more layers.

Is it a problem regarding TensorRT version? memory allocation? or anything else.

Has any of you faced a similar issue? Your answers are welcome.

python code snippet

N = 9
NUM_FILTERS = 32
NUM_INPUT_CHANNELS = 10

board_input2 = tf.placeholder(tf.float32, shape=(NUM_INPUT_CHANNELS, N, N), name=‘input_node’)
board_input3 = tf.reshape(board_input2, [N, N, NUM_INPUT_CHANNELS])

h11 = convLayer(board_input3, fill_conv4_matrix(NUM_INPUT_CHANNELS, NUM_FILTERS, 3, L1W), layer_num=0)

value_output = tf.sigmoid(tf.reshape(h11, [NN32]), name=‘v_output’)

// C++ code snippet

parser->registerInput(“input_node”, Dims3(10, 9, 9), UffInputOrder::kNCHW);
parser->registerOutput(“v_output”);

Hello,

This is not expected. abaranappu, can you provide the uff file and the uff.txt file as well used for this experiment? What was the input data given to the C++ program? Did that stay constant?

thanks

Thank you very much for replying.

uff, text files and C++ files are in here. https://github.com/hdpoorna/nv_forum
And also attached as a zip file.

simple.uff has one output, full.uff has two outputs.
both have inconsistent outputs with the same input, and in the case of full.uff, it is always the second output that has errors, and rarely the first.Error happens even if we switch the order of the 2 outputs.
files.zip (215 KB)

Thank you for the information. Will keep you updated on what we find.

Hello, we are unable to repro this issue. Tried running simple.uff around 10 times, and saw the same output all 10 times For full.uff we only see some difference in the 7th or 8th decimal places.

Can you re-run with info level logging (kINFO) and compare the logs between the two runs with varying outputs.
I don’t think this is a parser issue because the parser will generate the same network given the same uff file.

Could be possible that different consecutive runs select different kernels which may be causing some minor difference in outputs.

Can you tell me to what degree your outputs vary?

The second output, v_output (batch size 2, dim 1) is constantly changing, but not the p_output (batch_size 2, dim 82)

is uff version 0.3.0 ok?

results are in the zip file attached.
full_cpp.txt - results from C++ file
full_py.txt - result from running the same thing on python
Uff_mcts.cpp - C++ file

I will also post results when the outputs are swapped when we create the uff file.

also, could you help me run kINFO log?
results.zip (6.66 KB)

results when the outputs are swapped when we create the uff file, are attached here.

Please note that the 2nd output is printed 1st.

changed the line as follows,
output_node_names = [‘p_output’, ‘v_output’] -----> output_node_names = [‘v_output’, ‘p_output’]

before plugging it into,
uff_model = uff.from_tensorflow(final_output_graph_def,
output_node_names,
intput_node=‘input_node’,
output_filename = “models/full.uff”,
text=True)
outputs_swapped.zip (3.53 KB)

Hi, I’m abaranappu’s supervisor, and thanks for taking the time to help us resolve this.

The second output, which is changing rapidly, refers to the output assigned to buffer 2, whereas the first output refers to that in buffer 1.

To further complicate the issue, the first output is also not consistent, but one will need to run around 30-40 times, with the same input, before seeing problems.

The neural network is a simple residual CNN (4 blocks) with two heads, one with an MLP layer leading to ‘p-output’, the other with another MLP layer leading to ‘v_output’.

Please help.

Hello,

We tried running the experiment with full.uff around 20 times and saw the differences in v_output. The info level log doesn’t show any difference in the layers being run so not a parser issue. The difference needs to be investigated further.

Regarding your questions:
is uff version 0.3.0 ok?
UFF 0.4.0 is the shipped version for TRT 4.0 GA, so that is recommended (although I don’t see this as a UFF issue)
can you share documentation on how to enable info level logging (kINFO)?
To turn on INFO level logging, change the severity of the logger object from kWARNING to kINFO

Thank you so very much for helping.
Please help us figure out the issue.

Hello,

Quick update. We are making progress debugging this issue. One thing that will help us a lot is a figure or description of what your network looks like: specifically, what each layer is doing, the kernel sizes, connections, etc.

thanks.

NVIDIA Enterprise Support

Here is the network part of the code. The whole code and weights are attached in a zip file.

board_input2 = tf.placeholder(tf.float32, shape=(None, NUM_INPUT_CHANNELS, N, N), name='input_node')

layers = []
layers.append(convLayer(board_input2, fill_conv4_matrix(NUM_INPUT_CHANNELS, NUM_FILTERS, 3, L1W), layer_num=0))

for i in range(1, NUM_RES_BLOCKS + 1):    
    layers.append(res_block(layers[i-1], fill_4R_matrix(NUM_FILTERS, NUM_FILTERS, 3, WA[i-1]), fill_4R_matrix(NUM_FILTERS, NUM_FILTERS, 3, WB[i-1]), layer_num=i))

# value
value_head = convLayer(layers[NUM_RES_BLOCKS], fill_4R_matrix(NUM_FILTERS, 4, 3, vh_W), layer_num=1)
    
value_inv_output_logits = logits(tf.reshape(value_head, [-1, N*N*4]), fill_inv_matrix(64, vh_HW), vh_Hb, 'vh')

hidden_layer_output = tf.nn.relu(value_inv_output_logits)

value_output = tf.sigmoid(tf.matmul(hidden_layer_output, vh_sigL_W) + vh_sigL_b, name='v_output')             
    
# policy
policy_head = convLayer(layers[NUM_RES_BLOCKS], fill_4R_matrix(NUM_FILTERS, 4, 3, ph_W), layer_num=2)
       
pass_output = logits(tf.reshape(policy_head, [-1, N*N*4]), fill_inv_matrix(1, ph_pass_W), ph_pass_b, 'ph')

equiv_output_h1 = tf.reduce_mean(tf.nn.conv2d(policy_head, fill_4R_matrix(4, 4, 2*N - 1, ph_rest_W), strides=[1, 1, 1, 1], padding='SAME', data_format='NCHW'), 1)
equiv_output = tf.reshape(equiv_output_h1 + ph_rest_b, [-1, N*N])

policy_output_logits = tf.concat([equiv_output, pass_output], 1)

policy_output = tf.nn.softmax(policy_output_logits, name='p_output')

net.zip (196 KB)

Hello abaranappu,

There was a bug in the initialization of the input buffer in the user’s code. Some of the values were uninitialized, thus having garbage values.

We added these lines in createMnistCudaBuffer to initialize the buffer correctly:

for (int i = 0; i < eltCount; ++i)
    inputs[i] = 0.0f;

Another bug found in user’s code.
This line:
buffers.reserve(nbBindings);
Should be:
buffers.resize(nbBindings);

The input values are now fixed and unchanging. The layers of the engine are unchanged too now. The 2 V outputs are approximately equal, but their values change in some executions:

Iteration 1:
0 => 0.999999 : ***
1 => 0.999999 :
Iteration 2:
0 => 0.397986 : ***
1 => 0.39674 :
Iteration 3:
0 => 0.395557 : ***
1 => 0.394716 :
Iteration 4:
0 => 0.395557 : ***
1 => 0.394716 :
Iteration 5:
0 => 1 : ***
1 => 1 :
Iteration 6:
0 => 0.397986 : ***
1 => 0.39674 :
Iteration 7:
0 => 0.518526 : ***
1 => 0.513405 :

You mentioned your implimentation is derived from sampleUffMNIST, and we are able to repeatedly run ./sample_uff_mnist with consistent outputs.

I recommend reviewing the user code.

Hi NVES,
Thanks for taking the time to check this. We have fixed the initialization, but the bug persists, and we can prove that Tensor RT is optimizing at the expense of accuracy.

We’ve changed the batch size to 1 and the input to all zeros to simplify things. Also we’ve added a loop around execute

for (int i = 0; i < 50; i++)
{
execute(*engine);
}

to show that that the output does not change within runs, but rather between runs, so something different is happening each time the model is parsed and the engine is created.

Below is an example of different outputs from two different runs; note that each output is repeated 50 times in each run. We’ve attached the logs for both runs.

Output from first run:

— OUTPUT —
0 => 0 :
1 => 0 :
2 => 0 :
3 => 0 :
4 => 0 :
5 => 0 :
6 => 0 :
7 => 0 :
8 => 0 :
9 => 0 :
10 => 0 :
11 => 0 :
12 => 0 :
13 => 0 :
14 => 0 :
15 => 0 :
16 => 0 :
17 => 0 :
18 => 0 :
19 => 0 :
20 => 0 :
21 => 0 :
22 => 0 :
23 => 0 :
24 => 0 :
25 => 0 :
26 => 0 :
27 => 0 :
28 => 0 :
29 => 0 :
30 => 0 :
31 => 0 :
32 => 0 :
33 => 0 :
34 => 0 :
35 => 0 :
36 => 0 :
37 => 0 :
38 => 0 :
39 => 0 :
40 => 0 :
41 => 0 :
42 => 0 :
43 => 0 :
44 => 0 :
45 => 0 :
46 => 0 :
47 => 0 :
48 => 0 :
49 => 0 :
50 => 0 :
51 => 0 :
52 => 0 :
53 => 0 :
54 => 1 : ***
55 => 0 :
56 => 0 :
57 => 0 :
58 => 0 :
59 => 0 :
60 => 0 :
61 => 0 :
62 => 0 :
63 => 0 :
64 => 0 :
65 => 0 :
66 => 0 :
67 => 0 :
68 => 0 :
69 => 0 :
70 => 0 :
71 => 0 :
72 => 0 :
73 => 0 :
74 => 0 :
75 => 0 :
76 => 0 :
77 => 0 :
78 => 0 :
79 => 0 :
80 => 0 :
81 => 0 :

1 eltCount
— OUTPUT —
0 => 1 : ***

Output from second run:

— OUTPUT —
0 => 0.0122097 : ***
1 => 0.0122097 :
2 => 0.0122097 :
3 => 0.0122097 :
4 => 0.0122097 :
5 => 0.0122097 :
6 => 0.0122097 :
7 => 0.0122097 :
8 => 0.0122097 :
9 => 0.0122097 :
10 => 0.0122097 :
11 => 0.0122097 :
12 => 0.0122097 :
13 => 0.0122097 :
14 => 0.0122097 :
15 => 0.0122097 :
16 => 0.0122097 :
17 => 0.0122097 :
18 => 0.0122097 :
19 => 0.0122097 :
20 => 0.0122097 :
21 => 0.0122097 :
22 => 0.0122097 :
23 => 0.0122097 :
24 => 0.0122097 :
25 => 0.0122097 :
26 => 0.0122097 :
27 => 0.0122097 :
28 => 0.0122097 :
29 => 0.0122097 :
30 => 0.0122097 :
31 => 0.0122097 :
32 => 0.0122097 :
33 => 0.0122097 :
34 => 0.0122097 :
35 => 0.0122097 :
36 => 0.0122097 :
37 => 0.0122097 :
38 => 0.0122097 :
39 => 0.0122097 :
40 => 0.0122097 :
41 => 0.0122097 :
42 => 0.0122097 :
43 => 0.0122097 :
44 => 0.0122097 :
45 => 0.0122097 :
46 => 0.0122097 :
47 => 0.0122097 :
48 => 0.0122097 :
49 => 0.0122097 :
50 => 0.0122097 :
51 => 0.0122097 :
52 => 0.0122097 :
53 => 0.0122097 :
54 => 0.0122097 :
55 => 0.0122097 :
56 => 0.0122097 :
57 => 0.0122097 :
58 => 0.0122097 :
59 => 0.0122097 :
60 => 0.0122097 :
61 => 0.0122097 :
62 => 0.0122097 :
63 => 0.0122097 :
64 => 0.0122097 :
65 => 0.0122097 :
66 => 0.0122097 :
67 => 0.0122097 :
68 => 0.0122097 :
69 => 0.0122097 :
70 => 0.0122097 :
71 => 0.0122097 :
72 => 0.0122097 :
73 => 0.0122097 :
74 => 0.0122097 :
75 => 0.0122097 :
76 => 0.0122097 :
77 => 0.0122097 :
78 => 0.0122097 :
79 => 0.0122097 :
80 => 0.0122097 :
81 => 0.0110169 :

1 eltCount
— OUTPUT —
0 => 0.384092 : ***

files.zip (135 KB)

Hello,

TRT was not handling horizontal merge of layers without bias weights correctly. We will fix this in a future version.

in the meantime, for TRT 4.0, user cannot use a convolution layer with no bias. Instead, a bias values of 0.0f should be used for such layers. This has the same effect as no bias, but will get around the bug in TRT 4.0

We are sorry any inconvinience this is causing. We cannot share more information about further release here.
Please pay attention to our announcement for the information.

Hi NVES,
Thanks for this information. We’ve added zero biases for all convolutional layers and we’ve vastly simplified the model so that it is almost entirely convolutional. However, the bug still resurfaces. Can you tell us which other layers are not supported? The model is specified below, and we’ve attached output logs.

def convLayer(input, W_val, layer_num, n_out):
W = tf.get_variable(‘cW’ + str(layer_num), initializer=W_val)
h1 = tf.nn.conv2d(input, W, strides=[1, 1, 1, 1], padding=‘SAME’)
b = tf.get_variable(“cb” + str(layer_num), [n_out], initializer=tf.zeros_initializer())
tf.nn.bias_add(h1, b)

return tf.nn.relu(h1)

def logits(input, W_val, b_val, name):
W = tf.get_variable(name + ‘_W’, initializer=W_val)
b = tf.get_variable(name + ‘_b’, initializer=b_val)

return tf.nn.bias_add(tf.matmul(input, W_val), b)

def res_block(input, WA_val, WB_val, layer_num, n_out):
WA = tf.get_variable(‘rWA’ + str(layer_num), initializer=WA_val)
h1 = tf.nn.conv2d(input, WA, strides=[1, 1, 1, 1], padding=‘SAME’)
b1 = tf.get_variable(“rbA” + str(layer_num), [n_out], initializer=tf.zeros_initializer())
tf.nn.bias_add(h1, b1)

inter = tf.nn.relu(h1)

WB = tf.get_variable('rWB' + str(layer_num), initializer=WB_val)
h1b = tf.nn.conv2d(inter, WB, strides=[1, 1, 1, 1], padding='SAME')    
b2 = tf.get_variable("rbB" + str(layer_num), [n_out], initializer=tf.zeros_initializer())
tf.nn.bias_add(h1b, b2)

return tf.add(tf.nn.relu(h1b), input)

board_input2 = tf.placeholder(tf.float32, shape=(None, N, N, NUM_INPUT_CHANNELS), name=‘input_node’)

layers =
layers.append(convLayer(board_input2, L1W, layer_num=0, n_out=NUM_FILTERS))

for i in range(1, NUM_RES_BLOCKS + 1):
layers.append(res_block(layers[i-1], WA[i-1], WB[i-1], layer_num=i, n_out=NUM_FILTERS))

value_head = convLayer(layers[NUM_RES_BLOCKS], vh_W, layer_num=1, n_out=4)
value_output = tf.sigmoid(value_head, name=‘v_output’)

policy_head = convLayer(layers[NUM_RES_BLOCKS], ph_W, layer_num=2, n_out=4)
pass_output = logits(tf.reshape(policy_head, [-1, NN4]), ph_pass_W, ph_pass_b, ‘ph’)
policy_output = tf.nn.softmax(pass_output, name=‘p_output’)
files2.zip (98.8 KB)

hello,

unfortunately, we don’t think there’s a workaround with TRT4. Bias values of zero have no effect on convolution layer output. TRT removes those bias values. Adding zero bias to network was just a temporary solution and it looks like this solution doesn’t work.

We do have a fix in a future release of TRT.

Sorry again that we cannot share more information about future release here.
Please pay attention to our announcement for the information.

Hi,

Ubuntu 16.04
NVIDIA GeForce RTX 2080 Ti
NVIDIA driver 410.66
CUDA 10.0.130
CUDNN 7.3.1
Python 3.5
Tensorflow 1.12.0-rc1
TensorRT 5.0.0.10 RC

The bug is still there.
outputs are attached.
we also switched the outputs, so that it’s clear.
outputs_trt5.rtf (28.4 KB)

Hello,

My apologies, the fix did not go into TRT 5.0.0.10. The fix is committed for TRT 5 GA. Please stay tuned for the release announcement.