The advantage is that you can reduce up to 1/Nth the cost of copying data to the device if you have N streams, as the calls are asynchronous, if you plot the actions occuring on your device on a timeline you’d have something like:
where [name]/[name] refers to concurrent execution and # to
(This might not be your case, dependes on whether you have memory bound or computing bound throughput on your application, but for the example above suppose memory copies and kernel execution take the same amount of time)
They do share the hardware but between kernel launching and returning to host, data can be copied without penalty (not quite sure if the penalty is absolute zero) to and from the device.