HI folks I wonder if anyone here attended this webinar.
I’ve read that the speaker answered question related to “Device to device memory transfer” and “up to 1 TB of memory”.
If you heard this part of the talk could you sum up what has been said on these two topics?
I watched it…for the 1TB of memory thing, he (Sumit Gupta) said that the Fermi architecture can address that amount of memory; I assume that means they’re using 40-bit addressing. He also said that while the new Teslas will have 3GB or 6GB of memory (depending on the model), they are looking into even higher amounts for future devices.
FERMI has 2 DMA engines. so you DATA_IN and DATA_OUT of the GPU memory (from/to system RAM) SIMULTANEOUSLY along with KERNEL EXECUTIONS…
THis FEATURE is necessary because of their MULTIPLE_KERNEL execution strategy. This OPENS UP parallelism at a NEW LEVEL.
And, yeah 1TB of RAM. Cool. and UNIFIED POINTER support. Thus, at run-time, the memory generation unit can determine whether a pointer is shared OR global… That means, not entire 64-bits are used fo Global Memory.
A configurable L1 cache per SM - can be configured as 16K Shared Mem + 48K L1 OR Viceversa
Unified L2 support
8x double precision peaking @ 50% of single precision speed
ECC support from registers to DRAM
and what not…
I am sure they will be pricing these for elephants…
It would be good if NV allows developers to submit jobs for FERMI and get results much like what intel does with http://paralleluniverse.intel.com
The PCI-Express bus supports something like this. He said Device 1 to Device 2 memory transfer was mostly a software thing, IIRC. E.g. it is on their roadmap to support it at some point. But that roadmap contains a lot of things they still want to add in the future…
Not in the context of FERMI. That was in the context of building a supercomputer with GPUs – He was dwelling on what kind of issues need to be sorted out… Its more hypothetical… Nothing was promised or no roadmap… Just hypothetical…
I dont think FERMI has that capability.
But well, I see a post above from gshi on that. but the answer looks completely irrelevant… Possible that I Missed something…
Well, nothing else that I remember… but there was talk on how TESLA cards handle memory coalescing better… (the reason why profiler always generates 0 un-coalesced access). That was good… TESLA figures out the memory segments accessed by a HALF-WARP, makes coalesced accesses to these set of segments and routes the data correctly to the thread… Therez a chance of extra memory being fetched…(32 bytes being the minimum) but lesser transactions…