Kepler and Maxwell, oh my!

Apparently you lucky people at GTC got to hear the names of the next two (!) CUDA architecture revisions: Kepler in 2011 and Maxwell in 2013:

http://www.electronista.com/articles/10/09…ent.than.fermi/

(As a physicist, I’m really digging these architecture names, BTW. :) )

Can anyone actually at GTC dish any details on new features? Did they actually say anything about Kepler beyond projected performance per watt?

He said almost nothing except three more very brief clues:

  1. Preemption

  2. Virtual memory

  3. Lower CPU dependence

Now the next question is what those three things mean, since they were not expanded upon… so chew those few words any way you like.

Preemption I get, that makes sense. Virtual memory I don’t understand, since we already have zero-copy, so it must mean something else.
Lower CPU I also don’t get, (and he may have said that slightly differently.) As a complete guess it may mean something like GPU self-kernel-scheduling, but Jen-Hsen did not say it that way.

GTC is awesome as expected… too much to see! Exhibition is also much bigger than last year.

He said almost nothing except three more very brief clues:

  1. Preemption

  2. Virtual memory

  3. Lower CPU dependence

Now the next question is what those three things mean, since they were not expanded upon… so chew those few words any way you like.

Preemption I get, that makes sense. Virtual memory I don’t understand, since we already have zero-copy, so it must mean something else.
Lower CPU I also don’t get, (and he may have said that slightly differently.) As a complete guess it may mean something like GPU self-kernel-scheduling, but Jen-Hsen did not say it that way.

GTC is awesome as expected… too much to see! Exhibition is also much bigger than last year.

I’d say virtual memory would refer to kernels having their own virtual address space, instead of referring to the existing global/shared/register file/etc address spaces (or unified address space in 2.x architectures)? Could be added security benefits here?

Or perhaps the concept of a ‘virtual memory pool’, which can exceed the size of the GPU’s physical memory limits - but is addressable just like normal memory - and somehow the device schedules paging in/out data (via zero-copy) from the host (which in turn has virtual memory behavior like this - making memory limitations on GPUs equal to that of GPU memory + host memory + host virtual memory size (page/swap file size))…

Just uneducated guesses though.

Lower CPU dependence I’d assume means the GPU can actively run without a driver, as if it had it’s own internal OS in the GPU keeping things in order. This means you could essentially have ‘driver’ kernels, running 24/7 on the GPU, which then schedule workloads (also likely means nVidia could offload a lot of the driver logic into CUDA space, and only have OS dependent interfacing in the host-side drivers - avoiding user<->kernel space & host-device-locking overheads).

Again, purely wild speculation here to get the ball rolling ;)

I’d say virtual memory would refer to kernels having their own virtual address space, instead of referring to the existing global/shared/register file/etc address spaces (or unified address space in 2.x architectures)? Could be added security benefits here?

Or perhaps the concept of a ‘virtual memory pool’, which can exceed the size of the GPU’s physical memory limits - but is addressable just like normal memory - and somehow the device schedules paging in/out data (via zero-copy) from the host (which in turn has virtual memory behavior like this - making memory limitations on GPUs equal to that of GPU memory + host memory + host virtual memory size (page/swap file size))…

Just uneducated guesses though.

Lower CPU dependence I’d assume means the GPU can actively run without a driver, as if it had it’s own internal OS in the GPU keeping things in order. This means you could essentially have ‘driver’ kernels, running 24/7 on the GPU, which then schedule workloads (also likely means nVidia could offload a lot of the driver logic into CUDA space, and only have OS dependent interfacing in the host-side drivers - avoiding user<->kernel space & host-device-locking overheads).

Again, purely wild speculation here to get the ball rolling ;)

Online sources reported this topic as “Virtualization”. So it seems more oriented towards the server farm people who need to share a GPU among several guest OS’es. Or maybe to run virtual instances of a GPU on a single hardware device.

Christian

Online sources reported this topic as “Virtualization”. So it seems more oriented towards the server farm people who need to share a GPU among several guest OS’es. Or maybe to run virtual instances of a GPU on a single hardware device.

Christian

i++ :)

i++ :)

But I think there is also a semiconductor company called Maxwell. Hmm… if that won’t cause trouble later on? For similar reasons Parallel Nsight got its current name - the previous name Nexus was already taken (trademark issue).

But I think there is also a semiconductor company called Maxwell. Hmm… if that won’t cause trouble later on? For similar reasons Parallel Nsight got its current name - the previous name Nexus was already taken (trademark issue).

Hi Seibert, Thanks for starting this thread! Vow!

My guesses:

Pre-emption
– Will make it possible to run as many CUDA kernels in parallel as desired. Just like how CPU runs any number of programs without allowing any program to hog the CPU.
– May be, we will get an option to control the scheduling policy - whether to use preemption or not.

Virtual memory
– Will make sure that “preemption” cuts “across” contexts… breaking FERMI’s limitation of running concurrent kernels only in the same context.
– And may be, it can use CPU RAM as a secondary store for kernel’s physical data (just like how OS pages memory to disk, GPU could page
its memory pages to host PC’s RAM). This will be required to support “any” number of kernels to run at same time.
– but then, it will be very slow. Think of thrashing across PCIe bus… ;-)

Minimal CPU
– GPU firmware can make more informed decisions to make the “Pre-emption” work. like paging in and out of System RAM etc.
– It need not depend on the driver code (as some1 rightly pointed above earlier… SPWorley? lazy to scroll…)

Hi Seibert, Thanks for starting this thread! Vow!

My guesses:

Pre-emption
– Will make it possible to run as many CUDA kernels in parallel as desired. Just like how CPU runs any number of programs without allowing any program to hog the CPU.
– May be, we will get an option to control the scheduling policy - whether to use preemption or not.

Virtual memory
– Will make sure that “preemption” cuts “across” contexts… breaking FERMI’s limitation of running concurrent kernels only in the same context.
– And may be, it can use CPU RAM as a secondary store for kernel’s physical data (just like how OS pages memory to disk, GPU could page
its memory pages to host PC’s RAM). This will be required to support “any” number of kernels to run at same time.
– but then, it will be very slow. Think of thrashing across PCIe bus… ;-)

Minimal CPU
– GPU firmware can make more informed decisions to make the “Pre-emption” work. like paging in and out of System RAM etc.
– It need not depend on the driver code (as some1 rightly pointed above earlier… SPWorley? lazy to scroll…)

O… Smokey!!! It was not SPworley as in my prev post.

O… Smokey!!! It was not SPworley as in my prev post.

Taken together, it sounds like the theme for Kepler is “support pre-emptive multitasking on the GPU.” (Much like I think you can summarize many of the interesting features in Fermi as coming from the mandate “support C++ without killing performance significantly.”)

Once you have preemptive multitasking, then GPU scheduling on a multi-user system will work just like it does for CPUs. Load increases will be handled gracefully, no more watchdog nonsense (unless you really screw up the driver), and lots of programs can safely include GPU acceleration paths without having to worry over whether they will collide with each other.

Of course, graceful handling of increased loads will benefit from GPU virtual memory, much like the 3D drivers provide. Being able to properly include host memory into the GPU memory hierarchy would be nice, except you have this problem:

registers–>shared memory/L1–>L2–>global memory–>[PCI-E 2.0–>QPI/HT/direct to CPU]–>host memory

The part in brackets constricts the data flow between global and host memory to levels well below the performance of either end. PCI-E 3.0 (ready in time for Kepler) jumping the bandwidth up to 16 GB per sec will help, assuming that chipsets can talk to their respective CPUs at that rate. That still comes in below the full bandwidth of triple channel DDR3, but at least it is some improvement and would make “swapping to host memory” less painful.

I really wish that #3 meant that NVIDIA was going to partner up with VIA to drop a 64-bit general-purpose supervisory processor onto the GPU die and effectively bring the Cell architecture up to date. (I think VIA is the only company left with an x86-64 core that isn’t trying to put NVIDIA out of business…) That’s probably a little extreme, but would make for really awesome GPU blade servers…

Taken together, it sounds like the theme for Kepler is “support pre-emptive multitasking on the GPU.” (Much like I think you can summarize many of the interesting features in Fermi as coming from the mandate “support C++ without killing performance significantly.”)

Once you have preemptive multitasking, then GPU scheduling on a multi-user system will work just like it does for CPUs. Load increases will be handled gracefully, no more watchdog nonsense (unless you really screw up the driver), and lots of programs can safely include GPU acceleration paths without having to worry over whether they will collide with each other.

Of course, graceful handling of increased loads will benefit from GPU virtual memory, much like the 3D drivers provide. Being able to properly include host memory into the GPU memory hierarchy would be nice, except you have this problem:

registers–>shared memory/L1–>L2–>global memory–>[PCI-E 2.0–>QPI/HT/direct to CPU]–>host memory

The part in brackets constricts the data flow between global and host memory to levels well below the performance of either end. PCI-E 3.0 (ready in time for Kepler) jumping the bandwidth up to 16 GB per sec will help, assuming that chipsets can talk to their respective CPUs at that rate. That still comes in below the full bandwidth of triple channel DDR3, but at least it is some improvement and would make “swapping to host memory” less painful.

I really wish that #3 meant that NVIDIA was going to partner up with VIA to drop a 64-bit general-purpose supervisory processor onto the GPU die and effectively bring the Cell architecture up to date. (I think VIA is the only company left with an x86-64 core that isn’t trying to put NVIDIA out of business…) That’s probably a little extreme, but would make for really awesome GPU blade servers…

Something else that came to mind, if they do manage to lower CPU dependence by putting more general driver logic into the GPU itself - they can probably do SLI without any real CPU/chipset requirements… it’ll purely be the GPUs coordinating over the SLI link(s) :)

Something else that came to mind, if they do manage to lower CPU dependence by putting more general driver logic into the GPU itself - they can probably do SLI without any real CPU/chipset requirements… it’ll purely be the GPUs coordinating over the SLI link(s) :)

Something else that came to mind, if they do manage to lower CPU dependence by putting more general driver logic into the GPU itself - they can probably do SLI without any real CPU/chipset requirements… it’ll purely be the GPUs coordinating over the SLI link(s) :)