Significant performance issues with AddTensor

When adding two identically 2D sized matrices, swapping dimensions such as

let sw (n,c,h,w) = (c,h,w,n) // F# function

can improve the performance of the algorithm by as much as 30 times. Also in the library I am writing AddTensor is a significant source of overhead, taking 3/4th as much time to do a matrix-vector broadcast addition as the matrix-matrix multiply in a feedforward net. I’ve just upgraded from v3 to v4 and can confirm that this issue is present on both versions.

It would be great if the algorithm was more intelligent in its execution. I am going to have to take to shuffling dimensions just to get it to work promptly in the feedforward case. I have not tested whether shuffling improves performance for true 4D tensors.