Unified programming model across multiple GPUs

As best I understand, with access to multiple V100 GPUs connected by nvlink, I have access to a unified memory model, yet I must still write host code that launches kernels in streams on each GPU.

Can nVidia design a driver offering me a unified stream that knows how to allocate my kernel across all the SMs on all the cards in the system, transparently to me?

Am I asking the impossible?

Or am I asking for something that already exists?

Thank you if someone could please point me to any existing discussion along these lines.