As best I understand, with access to multiple V100 GPUs connected by nvlink, I have access to a unified memory model, yet I must still write host code that launches kernels in streams on each GPU.
Can nVidia design a driver offering me a unified stream that knows how to allocate my kernel across all the SMs on all the cards in the system, transparently to me?
Am I asking the impossible?
Or am I asking for something that already exists?
Thank you if someone could please point me to any existing discussion along these lines.