Tesla M10

anon56509511 · May 18, 2016, 4:06pm

Hello.
New Tesla M10 was announced today (availability August 2016).

Is the 4x GM107 (Maxwell Gen 1) inside card ?
Why is h264 performance so low ?

Does Nvidia try to kill accelerated (h264) remoting protocols (available only to less then 1/2 of maximum VDI users) ?
Or, Is this only simple error (Maxwell Gen 1 h264 nvenc is about 25% slower than Maxwell Gen 2 -> 180.754=54, not enough but better) ?
Or, Is there some memory constraints (speed and/or capacity) ?
Or, VDI with external encoders will be only solution (https://gridforums.nvidia.com/default/topic/752/grid-vgpu-benchmarks/vdi-click-to-photon-with-raspberry-pi/).

M.C>

RachelBerry · May 18, 2016, 7:48pm

Hi M.C,

It’s a GM206 (Similar to the Quadro M2000). The K1 had 12 encoders per board (3 per GPU), this has 7. This board is envisaged for the same sets of applications and end-use cases to the K1, only bigger/better.

This is very much targeted at mainstream VDI office/business apps and also business XenApp. In XenApp and RDS scenarios the VDA’s do not use H.264 or NVENC so it’s irrelevent. In mainstream VDI most of the protocols are also non-H.264 based and even those that are, are not hardware accelerated. e.g. blast extreme’s jpeg/other options, Citrix’s thinwire compatibility/h.64 on standard VDA etc. Indeed in Citrix the recommendation for sever sclaability (see their templates) is to use the non-H.264 based thinwire compatibility and even their high-end vdi hdx 3d pro vda only has nvenc for linux rather than windows.

Blast extreme has nvenc for single monitor but that’s very much targeted at high-end apps/3d use cases - more suited to the complementary M60 product. When you are using 3D/CAD apps other resources are going to limit your sclaability.

MAke sense?
Rachel

anon56509511 · May 18, 2016, 10:05pm

Hello.

I did not find any GM206 with 640 core (5SMM) (only 768 (6SMM) or 1024 (8SMM)).
If it is GM206 (Maxwell Gen 2) why is h264 performance so low (it should be capable of 4*18=72 1080p30) ?

As I understand nvenc architecture there is only one nvenc block per gpu ("some" Maxwell have 2 blocks) (https://developer.nvidia.com/application-note) and "X * 1080p30" is only performance level indicator (application-note uses 1280x720 max FPS to compare different generation chips and encoder setup).

M.C>

PS: WMware bloggers combine M10 and h264 offloading - https://blogs.vmware.com/euc/2016/05/user-experience-nvidia-tesla-m10-vmware-horizon-7.html. NVidia should contact VMware and explain this issue.
PS1: "Tesla M10" seems to be upgrade (double density RAM and quadruple price) of mysterious "Grid M40" (not "Tesla M40", see other thread https://gridforums.nvidia.com/default/topic/825/general-discussion/nvidia-grid-m40-cards-anyone-know-about-these-/ -> http://www.servethehome.com/nvidia-grid-m40-4x-maxwell-gpus-16gb-ram-cards/).
PS2: pciid 10de:13bd GM107GL [GRID M40], 10de:1389 GM107GL [GRID M30] …

Update:
Tesla M10 is Grid M40 (pciid 10de:13bd, pcisubsystemid 10de:1160) with more RAM but with the same pcb board and the same pciid (see identification on picture bellow) !
See also "virtualization" profiles on the right side of picture ! ("*A" (Virtual Apps license), "*B" (Virtual PC license) and "*Q" (Virtual Workstation license) profiles)

[b]How to theoretically "upgrade" Grid M40 to Tesla M10 :-)

buy Grid M40 $1000 @ Ebay
buy 32x K4G80325FB-HC03 (GDDR5 SGRAM, 8Gbit, 16K/32ms, x32, 16Banks, 1.5V, 170FBGA, 0.33ns (6000Mbps)) ( $6 *32 = $192 )
replace 32x K4G41325FC-HC03 (released 2012) to K4G80325FB-HC03 (released 2015, mass production from Jan 2016) (see BGA rework)
the bios & inforom content may be required to be replaced
[/b]

Grid M40 back

Grid M10 back

anon56509511 · May 19, 2016, 10:48pm

Can NVidia publish the exact specification for "Tesla M10" ?

For example @nvidiagrid https://twitter.com/rspruijt/status/733288807722782720:

Should I assume that GM206 have 256 bit memory bus (both GM107 and GM206 have only 128bit) ?
Should I assume that DDR5 exists (GDDR5 yes!) ?
Should I assume that all Grid 2.0 cards have the same 4612 Gfops ?

Now updated and corrected from @nvidiagrid

JasonSouthernNV · May 20, 2016, 8:43am

Exact specifications will be published when the board is released.

Ruben’s slide has errors on it, which are being addressed, don’t assume anything.

You have now. The M10 only has 5SMM enabled to allow it to fit in the 225W power envelope.

Building a board to support 128 users is a series of compromises, if you want high performance, there’s the M60 and a 300W power draw.

Correct, and is related to the SMM’s. The stream count value is to simplify the comparison between boards and has been used across the GRID related boards.

With M10 not targeted at 3D and Pro-Vis workloads, and based on experience with K1’s the likelihood is that each user will not be generating a constant 30fps, it’s anticipated that whilst the stream count is lower, the load will be lower too.

anon56509511 · May 20, 2016, 11:23pm

Update:
[s]Ok, let’s translate this to HW point of view. Because NVidia is fabless TSMC usually do the chips. All chips GM206 (2,940 million transistors, size 228 mm², process 28 nm, 8x SMM, 2x GPC, 2x 64-bit memory controller) are the same.

But some chips does not work as a whole.

all 8 SMM (1024 cores) work - sell as GTX960, M4
only 6 SMM (768 cores) work - sell as GTX950, GTX950LP, M2000
- only 5 SMM (640 cores) work - sell as M10
only 4 SMM (512 cores) work - sell as GTX750v2

There should be many 5 SMM chips waiting for sell after one year of production but I do not understand why Grid line should be target for 5 SMM chips. NVidia should sell it in GTX line.

I think it is not power capping issue to use 5 SMM variant (M10 published TDP is 225W):

M4 (8 SMM) has TDP 50W (eg. 4*50W=200W + some W for double memory + 7W for PEX chip) (M2000 has TDP 75W but also 4x display port to feed)
M10 will not get more clock than M4 due to double GDDR5 chips on the bus.
(yes, K1 has GK107 with only 1 SMX (192 cores) but TDP 130W! (K340 has GK107 with all 2 SMX (384 cores) and TDP 225W))[/s]

~~NVidia should rethink used chip for M10 (eg. 8 SMM variant) or lower TDP requirements.~~

Ok, M10 is old (2014), lowend (GM107) (GM108<GM107<GM206<GM204<GM200) and overpriced (CAPEX version, M10 price ($2800 @ SuperMicro GPU-NVTM10) + 64x VDI NVidia licensing CCU (Grid4: VPC Perpetual + 1x required SUM for first year) = $2800 + 64*$125 = $10800 for fully working card (eg. $169 per VDI on 2560/64 core shares)) (It can be compared to Grid1 with 2xK1 + 64x VDI (no NVidia license needed) = 2x$1700 (eg. GRID K1 @ amazon $1700, $53 per VDI on 1536/64 core shares)). [b]Grid1 and Grid4 comparison speedup 2.0x (=(25601033Mhz)/(1536850Mhz)) and price 3.2x (=$169/$53) for 64 VDI.

You must buy old and overpriced "Nvidia Tesla M10" to make NVDA shareholders happier ! http://finance.yahoo.com/news/nvidia-the-yahoo-finance-company-of-the-year-173130275.html[/b]

Still missing explanatory answer - Why is M10 nvenc encoding speed so low ?

Ok, M10 is not for 3D apps (but even windows composer is 3D killer application itself) but H264 is encoding in full speed is essential. It is easy to write "not be generating a constant 30fps" but how to make the software to fulfill this vision.

yes, the CPU based protocols can be used but for modern VDI the DomU vCPU must be spared for user work (office applications, browser with hungry javascripts …). And how many vCPU DomU can get if you run 64 or 128 VDI on 2-way or 4-way CPU system ?
yes, the external accelerator can be used (PCoIP or other external encoder)
and the last - all want to use h264 nvenc but how to implement "not be generating a constant 30fps" with GridSDK and nvenc (eg. detect change in frame buffer) ?
1. NVFBCToCuda - to use Cuda to detect change - it does not work because vGPU does not support Cuda !
2. NVFBCToSys/NVIFRToSys - to use CPU to detect change - it does work but loosing DomU vCPU again.
3. NVFBCToHWEnc/NVIFRToHWEnc - to use nvenc to detect change and generate h264 stream - it does not work because nvenc on M10 is incapable to fulfill 30FPS requests for all DomU !

Can NVidia SW/HW team share idea how will M10 deliver modern nvenc accelerated VDI with such slow nvenc (only 28x 1080p@30 for 64x VDI) ?

Yours sincerely, M.C>

PS: deleted

JasonSouthernNV · May 21, 2016, 2:49pm

Rather a combative response.

There are no marketing people on this forum. Most posters on here hold either an advanced engineering degree or PhD, but you’re often asking for material information that is considered commercially sensitive or unpublished.

anon56509511 · May 21, 2016, 6:42pm

Sorry. I will not ask for anything never again in this forum.

JasonSouthernNV · May 21, 2016, 10:15pm

PM Sent.

RachelBerry · May 23, 2016, 9:39am

Hi M.C.,

We are actively seeking applications for our community advisor program. It’s a forum where users can talk to the product management and engineering team. It’s simply not Jason and my remit to release everything on the product and we aren’t often best placed to do so. Please do consider applying: https://gridforums.nvidia.com/default/topic/788/announcements/nvidia-seeking-grid-community-advisors-/

Best wishes,
Rachel

RachelBerry · May 23, 2016, 9:48am

> Can NVidia SW/HW team share idea how will M10 deliver modern nvenc accelerated VDI with such slow nvenc (only 28x 1080p@30 for 64x VDI)?

Hi M.C,

Different cards have different use cases and we balance the customer needs for cost and features. This card is aimed at office and business applications used in a VDI (vPC/vWks) context. Most of the VDI protocols don’t even support H.264 (blast extreme for 1 monitor and the smaller NICE DCV product). PCoIP requires additional hardware this comes with the board you are already using to support application acceleration.

Full screen video for 64 VDI users is very much a niche use case and not a mass VDI/apps use. In a business context, large areas of the screen (think about editing a word doc, scrolling a webpage) don’t change so it’s completely different to streamign high-end full screen video. Other cards offer other options e.g. the M60 and we also work with partners PCoIP to cover niches where they specialise.

Car manufacturers have different cars for different use cases… GPUs are similar. A more realistic workload would be a medium loginVSI (hacked to make consistent) where you will see certain workloads e.g. the mario game, ppt, video sections use H.264 more intensely but a large proportion not, if you also look at the cummulative bandwidth (so height of steps in graph reflects work load usage) you’ll see the bandwidth go up for the video/mario etc but remain remarkably low for a lot of the LoginVSI run on business applications, this lower bandwidth means higher framerates very possible.

Best wishes,
Rachel

JasonSouthernNV · May 24, 2016, 8:53am

Update to the information above.

M10 will be built around the GM107 not the GM206.

RachelBerry · May 24, 2016, 9:08am

RachelBerry:

Hi M.C,

It’s a ~~GM206 (Similar to the Quadro M2000)~~ Update: I gave M.C wrong info - it is as he states a GM107 apologies!. The K1 had 12 encoders per board (3 per GPU), this has 7. This board is envisaged for the same sets of applications and end-use cases to the K1, only bigger/better.

This is very much targeted at mainstream VDI office/business apps and also business XenApp. In XenApp and RDS scenarios the VDA’s do not use H.264 or NVENC so it’s irrelevent. In mainstream VDI most of the protocols are also non-H.264 based and even those that are, are not hardware accelerated. e.g. blast extreme’s jpeg/other options, Citrix’s thinwire compatibility/h.64 on standard VDA etc. Indeed in Citrix the recommendation for sever sclaability (see their templates) is to use the non-H.264 based thinwire compatibility and even their high-end vdi hdx 3d pro vda only has nvenc for linux rather than windows.

Blast extreme has nvenc for single monitor but that’s very much targeted at high-end apps/3d use cases - more suited to the complementary M60 product. When you are using 3D/CAD apps other resources are going to limit your sclaability.

MAke sense?
Rachel

anon56509511 · December 5, 2016, 10:06pm

NVidia GRID team unwilling to answer but NVidia Codec team finally disclosure the root cause - GM107 (M10) has 1 NVenc but GM204 (M6/M60) has 2 NVenc (see https://developer.nvidia.com/video-encode-decode-gpu-support-matrix).

Topic		Replies	Views
Comparing Kepler and Maxwell maximum H.264 Streams NVIDIA Virtual GPU Technology	16	36860	March 2, 2016
Video Encode and Decode GPU Support Matrix Video Processing & Optical Flow	89	282160	April 17, 2025
13 months with NVidia GRID and XenServer XenDesktop	27	42181	February 15, 2017
M10 for Video on VMWare running RDS 2016? General Discussion	13	23729	July 13, 2017
Session count limitation for NVENC (No Maxwell GPUs with 2+ NEVENC sessions?) GPU-Accelerated Libraries	25	33196	February 26, 2018
Turing H.264 Video Encoding Speed and Quality Technical Blog	9	2835	September 28, 2019
Suitable GRID NVIDIA Virtual GPU Technology	14	14922	June 14, 2017
Grid 2.0 General Discussion	95	136346	June 30, 2016
trying to get a tesla k10 online. cuda_5.5.22_linux_64.run fails Linux	18	5801	February 16, 2014
why "all CUDA-capable devices are busy or unavailable" ? CUDA Programming and Performance	34	64342	April 20, 2011

Tesla M10

Related topics