Kepler and Maxwell, oh my!

seibert · September 21, 2010, 10:22pm

Apparently you lucky people at GTC got to hear the names of the next two (!) CUDA architecture revisions: Kepler in 2011 and Maxwell in 2013:

[url=“http://www.electronista.com/articles/10/09/21/nvidia.kepler.to.be.3x.more.efficient.than.fermi/”]http://www.electronista.com/articles/10/09...ent.than.fermi/[/url]

(As a physicist, I’m really digging these architecture names, BTW. :) )

Can anyone actually at GTC dish any details on new features? Did they actually say anything about Kepler beyond projected performance per watt?

SPWorley · September 22, 2010, 2:35am

He said almost nothing except three more very brief clues:

Preemption
Virtual memory
Lower CPU dependence

Now the next question is what those three things mean, since they were not expanded upon… so chew those few words any way you like.

Preemption I get, that makes sense. Virtual memory I don’t understand, since we already have zero-copy, so it must mean something else.
Lower CPU I also don’t get, (and he may have said that slightly differently.) As a complete guess it may mean something like GPU self-kernel-scheduling, but Jen-Hsen did not say it that way.

GTC is awesome as expected… too much to see! Exhibition is also much bigger than last year.

SPWorley · September 22, 2010, 2:35am

He said almost nothing except three more very brief clues:

Preemption
Virtual memory
Lower CPU dependence

Now the next question is what those three things mean, since they were not expanded upon… so chew those few words any way you like.

Preemption I get, that makes sense. Virtual memory I don’t understand, since we already have zero-copy, so it must mean something else.
Lower CPU I also don’t get, (and he may have said that slightly differently.) As a complete guess it may mean something like GPU self-kernel-scheduling, but Jen-Hsen did not say it that way.

GTC is awesome as expected… too much to see! Exhibition is also much bigger than last year.

Smokey · September 22, 2010, 4:59am

I’d say virtual memory would refer to kernels having their own virtual address space, instead of referring to the existing global/shared/register file/etc address spaces (or unified address space in 2.x architectures)? Could be added security benefits here?

Or perhaps the concept of a ‘virtual memory pool’, which can exceed the size of the GPU’s physical memory limits - but is addressable just like normal memory - and somehow the device schedules paging in/out data (via zero-copy) from the host (which in turn has virtual memory behavior like this - making memory limitations on GPUs equal to that of GPU memory + host memory + host virtual memory size (page/swap file size))…

Just uneducated guesses though.

Lower CPU dependence I’d assume means the GPU can actively run without a driver, as if it had it’s own internal OS in the GPU keeping things in order. This means you could essentially have ‘driver’ kernels, running 24/7 on the GPU, which then schedule workloads (also likely means nVidia could offload a lot of the driver logic into CUDA space, and only have OS dependent interfacing in the host-side drivers - avoiding user<->kernel space & host-device-locking overheads).

Again, purely wild speculation here to get the ball rolling ;)

Smokey · September 22, 2010, 4:59am

I’d say virtual memory would refer to kernels having their own virtual address space, instead of referring to the existing global/shared/register file/etc address spaces (or unified address space in 2.x architectures)? Could be added security benefits here?

Or perhaps the concept of a ‘virtual memory pool’, which can exceed the size of the GPU’s physical memory limits - but is addressable just like normal memory - and somehow the device schedules paging in/out data (via zero-copy) from the host (which in turn has virtual memory behavior like this - making memory limitations on GPUs equal to that of GPU memory + host memory + host virtual memory size (page/swap file size))…

Just uneducated guesses though.

Lower CPU dependence I’d assume means the GPU can actively run without a driver, as if it had it’s own internal OS in the GPU keeping things in order. This means you could essentially have ‘driver’ kernels, running 24/7 on the GPU, which then schedule workloads (also likely means nVidia could offload a lot of the driver logic into CUDA space, and only have OS dependent interfacing in the host-side drivers - avoiding user<->kernel space & host-device-locking overheads).

Again, purely wild speculation here to get the ball rolling ;)

cbuchner1 · September 22, 2010, 9:14am

Online sources reported this topic as “Virtualization”. So it seems more oriented towards the server farm people who need to share a GPU among several guest OS’es. Or maybe to run virtual instances of a GPU on a single hardware device.

Christian

cbuchner1 · September 22, 2010, 9:14am

Online sources reported this topic as “Virtualization”. So it seems more oriented towards the server farm people who need to share a GPU among several guest OS’es. Or maybe to run virtual instances of a GPU on a single hardware device.

Christian

Jimmy_Pettersson · September 22, 2010, 9:21am

i++ :)

Jimmy_Pettersson · September 22, 2010, 9:21am

i++ :)

cbuchner1 · September 22, 2010, 11:10am

But I think there is also a semiconductor company called Maxwell. Hmm… if that won’t cause trouble later on? For similar reasons Parallel Nsight got its current name - the previous name Nexus was already taken (trademark issue).

cbuchner1 · September 22, 2010, 11:10am

But I think there is also a semiconductor company called Maxwell. Hmm… if that won’t cause trouble later on? For similar reasons Parallel Nsight got its current name - the previous name Nexus was already taken (trademark issue).

Sarnath · September 22, 2010, 12:18pm

Hi Seibert, Thanks for starting this thread! Vow!

My guesses:

Pre-emption
– Will make it possible to run as many CUDA kernels in parallel as desired. Just like how CPU runs any number of programs without allowing any program to hog the CPU.
– May be, we will get an option to control the scheduling policy - whether to use preemption or not.

Virtual memory
– Will make sure that “preemption” cuts “across” contexts… breaking FERMI’s limitation of running concurrent kernels only in the same context.
– And may be, it can use CPU RAM as a secondary store for kernel’s physical data (just like how OS pages memory to disk, GPU could page
its memory pages to host PC’s RAM). This will be required to support “any” number of kernels to run at same time.
– but then, it will be very slow. Think of thrashing across PCIe bus… ;-)

Minimal CPU
– GPU firmware can make more informed decisions to make the “Pre-emption” work. like paging in and out of System RAM etc.
– It need not depend on the driver code (as some1 rightly pointed above earlier… SPWorley? lazy to scroll…)

Sarnath · September 22, 2010, 12:18pm

Hi Seibert, Thanks for starting this thread! Vow!

My guesses:

Pre-emption
– Will make it possible to run as many CUDA kernels in parallel as desired. Just like how CPU runs any number of programs without allowing any program to hog the CPU.
– May be, we will get an option to control the scheduling policy - whether to use preemption or not.

Virtual memory
– Will make sure that “preemption” cuts “across” contexts… breaking FERMI’s limitation of running concurrent kernels only in the same context.
– And may be, it can use CPU RAM as a secondary store for kernel’s physical data (just like how OS pages memory to disk, GPU could page
its memory pages to host PC’s RAM). This will be required to support “any” number of kernels to run at same time.
– but then, it will be very slow. Think of thrashing across PCIe bus… ;-)

Minimal CPU
– GPU firmware can make more informed decisions to make the “Pre-emption” work. like paging in and out of System RAM etc.
– It need not depend on the driver code (as some1 rightly pointed above earlier… SPWorley? lazy to scroll…)

Sarnath · September 22, 2010, 12:20pm

O… Smokey!!! It was not SPworley as in my prev post.

Sarnath · September 22, 2010, 12:20pm

O… Smokey!!! It was not SPworley as in my prev post.

seibert · September 22, 2010, 2:21pm

Taken together, it sounds like the theme for Kepler is “support pre-emptive multitasking on the GPU.” (Much like I think you can summarize many of the interesting features in Fermi as coming from the mandate “support C++ without killing performance significantly.”)

Once you have preemptive multitasking, then GPU scheduling on a multi-user system will work just like it does for CPUs. Load increases will be handled gracefully, no more watchdog nonsense (unless you really screw up the driver), and lots of programs can safely include GPU acceleration paths without having to worry over whether they will collide with each other.

Of course, graceful handling of increased loads will benefit from GPU virtual memory, much like the 3D drivers provide. Being able to properly include host memory into the GPU memory hierarchy would be nice, except you have this problem:

registers–>shared memory/L1–>L2–>global memory–>[PCI-E 2.0–>QPI/HT/direct to CPU]–>host memory

The part in brackets constricts the data flow between global and host memory to levels well below the performance of either end. PCI-E 3.0 (ready in time for Kepler) jumping the bandwidth up to 16 GB per sec will help, assuming that chipsets can talk to their respective CPUs at that rate. That still comes in below the full bandwidth of triple channel DDR3, but at least it is some improvement and would make “swapping to host memory” less painful.

I really wish that #3 meant that NVIDIA was going to partner up with VIA to drop a 64-bit general-purpose supervisory processor onto the GPU die and effectively bring the Cell architecture up to date. (I think VIA is the only company left with an x86-64 core that isn’t trying to put NVIDIA out of business…) That’s probably a little extreme, but would make for really awesome GPU blade servers…

seibert · September 22, 2010, 2:21pm

Taken together, it sounds like the theme for Kepler is “support pre-emptive multitasking on the GPU.” (Much like I think you can summarize many of the interesting features in Fermi as coming from the mandate “support C++ without killing performance significantly.”)

Once you have preemptive multitasking, then GPU scheduling on a multi-user system will work just like it does for CPUs. Load increases will be handled gracefully, no more watchdog nonsense (unless you really screw up the driver), and lots of programs can safely include GPU acceleration paths without having to worry over whether they will collide with each other.

Of course, graceful handling of increased loads will benefit from GPU virtual memory, much like the 3D drivers provide. Being able to properly include host memory into the GPU memory hierarchy would be nice, except you have this problem:

registers–>shared memory/L1–>L2–>global memory–>[PCI-E 2.0–>QPI/HT/direct to CPU]–>host memory

The part in brackets constricts the data flow between global and host memory to levels well below the performance of either end. PCI-E 3.0 (ready in time for Kepler) jumping the bandwidth up to 16 GB per sec will help, assuming that chipsets can talk to their respective CPUs at that rate. That still comes in below the full bandwidth of triple channel DDR3, but at least it is some improvement and would make “swapping to host memory” less painful.

I really wish that #3 meant that NVIDIA was going to partner up with VIA to drop a 64-bit general-purpose supervisory processor onto the GPU die and effectively bring the Cell architecture up to date. (I think VIA is the only company left with an x86-64 core that isn’t trying to put NVIDIA out of business…) That’s probably a little extreme, but would make for really awesome GPU blade servers…

Smokey · September 27, 2010, 10:32pm

Something else that came to mind, if they do manage to lower CPU dependence by putting more general driver logic into the GPU itself - they can probably do SLI without any real CPU/chipset requirements… it’ll purely be the GPUs coordinating over the SLI link(s) :)

Smokey · September 27, 2010, 10:32pm

Something else that came to mind, if they do manage to lower CPU dependence by putting more general driver logic into the GPU itself - they can probably do SLI without any real CPU/chipset requirements… it’ll purely be the GPUs coordinating over the SLI link(s) :)

Smokey · September 27, 2010, 10:32pm

Something else that came to mind, if they do manage to lower CPU dependence by putting more general driver logic into the GPU itself - they can probably do SLI without any real CPU/chipset requirements… it’ll purely be the GPUs coordinating over the SLI link(s) :)

Topic		Replies	Views
Fermi? Sounds interesting... CUDA Programming and Performance	58	15644	October 18, 2009
NVIDIA Announces Project Denver CUDA meets Cell... CUDA Programming and Performance	35	38237	August 11, 2014
Shopping-list for Cuda GPGPU System in 800-1000 euro price-range Goal: A 'budget' GTX 470 (F CUDA Programming and Performance	59	12116	April 15, 2010
CUDA Laptop A discussion on Benefit-Cost Ratio. CUDA Programming and Performance	42	37432	July 2, 2009
Larbee description CUDA Programming and Performance	53	52322	December 14, 2009
embed system the relation ship between arm cores and gup should be more different than pc system. Jetson TK1	0	535	June 8, 2015
A Killer CUDA box CUDA Programming and Performance	18	6485	September 28, 2008
most effective way to get a mobile CUDA gpu CUDA Programming and Performance	24	7655	September 29, 2008
board recommendation / headless dedicated / chipset tradeoffs CUDA Programming and Performance	18	10690	July 3, 2009
CUDA hardware & software CUDA Programming and Performance	9	2709	November 13, 2010

Kepler and Maxwell, oh my!

Related topics