I’m currently running some tests where I have X data / stream. That is, when I increase the number of streams I am also increasing the total amount of data being processed (captain obvious). I’m not splitting it into more chunks as the number of streams increase as would perhaps be the more common case here.
Before doing this implementation I read some whitepapers of others who had done the same, they reported that 8 streams was the magic number where they got the best performance ( without having figured out why this was ).
When I am comparing transferring the whole data block, computing, and sending back VS streaming I also get the best performance with 8 streams, in fact thats where i get a near 2x relative speedup.
Any ideas why this is?