I would guess, that with so many cores such small cache will not speed up memory transfers, but it can improve memory coherency, succesive writes are now seen on other cores in proper order, in addition atomic ops are faster on fermi thanks to cache. It would be cool to have ability to configure l2 cache as ‘global’ shared memory, if we could decide what should be stored in it maybe it would give more profit.
Thats my understanding… yes :( especially when Fermi costs at least 60% more than Tesla.
This is why I am asking this question… to see if I’m missing something here…
Thats my understanding… yes :( especially when Fermi costs at least 60% more than Tesla.
This is why I am asking this question… to see if I’m missing something here…
My kernel iterates over (medium size) 30,000 float arrays each of at least 1500 items, so a medium size kernel would iterate
over at least: 30,000 * 1500 * sizeof(float) bytes - a lot more than L1 and L2 (and i understand that the CPU situation is the same).
I do this at least 30 x 30 times (the grid size).
There is no way it can fit into the cache (any cache for that matter :) ), so what fermi gives me are caches that I can’t use, and supposedly more
FLOPs which I can’t use since my kernel is memory bounded. So I’m left with the 40% increase in memory bandwidth that comes with Fermi
and this is indeed what I see in my projects.
Now the question is this:
if most kernels are memory bounded and HPC users are using huge datasets which can not fit into the caches (like in my case), why use Fermi???
Any thoughts are more than welcomed… :)
eyal
My kernel iterates over (medium size) 30,000 float arrays each of at least 1500 items, so a medium size kernel would iterate
over at least: 30,000 * 1500 * sizeof(float) bytes - a lot more than L1 and L2 (and i understand that the CPU situation is the same).
I do this at least 30 x 30 times (the grid size).
There is no way it can fit into the cache (any cache for that matter :) ), so what fermi gives me are caches that I can’t use, and supposedly more
FLOPs which I can’t use since my kernel is memory bounded. So I’m left with the 40% increase in memory bandwidth that comes with Fermi
and this is indeed what I see in my projects.
Now the question is this:
if most kernels are memory bounded and HPC users are using huge datasets which can not fit into the caches (like in my case), why use Fermi???
Any thoughts are more than welcomed… :)
eyal
This seems to be the general theme of Fermi: If your kernel ran with high efficiency on previous generations, then Fermi is a modest improvement in performance. In some cases, the performance/$ metric will be worse than the previous generation until Fermi prices come down. However, if you have some code (or ideas for new code) with portions that are inefficient on older cards, then Fermi might be a huge improvement.
I think that’s why the response from the CUDA community has been mixed. First thing everyone did (including me) was go run their existing code on a Fermi card. For the most part that wasn’t too exciting. (Although it sounds like it was exciting for hoomd.) Now there’s going to be a 6 month lag while people (again, including me) apply Fermi to new problems, and that will be when we see Fermi really shine.
This seems to be the general theme of Fermi: If your kernel ran with high efficiency on previous generations, then Fermi is a modest improvement in performance. In some cases, the performance/$ metric will be worse than the previous generation until Fermi prices come down. However, if you have some code (or ideas for new code) with portions that are inefficient on older cards, then Fermi might be a huge improvement.
I think that’s why the response from the CUDA community has been mixed. First thing everyone did (including me) was go run their existing code on a Fermi card. For the most part that wasn’t too exciting. (Although it sounds like it was exciting for hoomd.) Now there’s going to be a 6 month lag while people (again, including me) apply Fermi to new problems, and that will be when we see Fermi really shine.
Why would new code make Fermi shine? how can you make use of Fermi’s new features (especially the L1/L2 caches and more cores) for memory bounded new/old algorithms
which have to use huge data sets that will not fit into the caches and will not be able to take advantage of the additional cores?
Seems to me that, at least for me, most of the future kernels I’ll have to write will be of the same type. Just get data as fast as you can from global memory and crunch it - this
is what the GPU was intended to be, no?
If I’m not mistaken Mr Anderson spoke about ~40% boost in hoomd - this is exactly what I get and is ~ the increased BW in Fermi.
I might be hursh here, but seems to me that nVidia might have missed the goal, no? a 40% boost for the majority of the kernels??
I would love to trade L1/L2 for additional BW. Now if it was only me, than that’s my problem.
I really love CUDA and the GPUs area, I’m just a bit buffled and disapointed at the performance of the long awaited fermi :(
eyal
Why would new code make Fermi shine? how can you make use of Fermi’s new features (especially the L1/L2 caches and more cores) for memory bounded new/old algorithms
which have to use huge data sets that will not fit into the caches and will not be able to take advantage of the additional cores?
Seems to me that, at least for me, most of the future kernels I’ll have to write will be of the same type. Just get data as fast as you can from global memory and crunch it - this
is what the GPU was intended to be, no?
If I’m not mistaken Mr Anderson spoke about ~40% boost in hoomd - this is exactly what I get and is ~ the increased BW in Fermi.
I might be hursh here, but seems to me that nVidia might have missed the goal, no? a 40% boost for the majority of the kernels??
I would love to trade L1/L2 for additional BW. Now if it was only me, than that’s my problem.
I really love CUDA and the GPUs area, I’m just a bit buffled and disapointed at the performance of the long awaited fermi :(
eyal
I think I should have used a different term: new code for new problems. By natural selection, most people currently apply CUDA to problems where it was most efficient: looping over large cache-unfriendly data sets with maximally coalesced reads and large amounts of independent floating point arithmetic on those elements. Other kinds of problems were not well addressed in CUDA, and so no one works on them. If a cache was extremely helpful to your problem, it is quite likely you would be using a CPU and not a GPU already. I think this is why there will be a lag in uptake of Fermi: its biggest win is broadening the kinds of problems that can be efficiently solved with CUDA. However, CUDA developers working on these new problems are by definition few because it would have been silly to try to do it before.
Let me give two examples:
A common problem in my area involves needing to recreate multidimensional histograms from a long lists of data (millions of tuples or more, easily). Histogramming is traditionally something that CUDA has been weak at, since it necessarily requires inter-thread coordination of some kind to merge results. Due to the challenge, people have developed reasonable algorithms that use some combination of shared memory, reduction-like techniques, and very limited use of atomics to maximize throughput. These algorithms are necessarily complicated, and embedding them into a larger calculation is annoying.
With Fermi, thanks to the ability of the L2 cache to service atomic operations at an amazing rate, it looks like now I can histogram in the simplest way possible, and still get tremendous performance. This will clean up my code to allow me to focus on the other parts, and generally be a win for me as a developer trying to solve a physics problem (as oppose to a developer trying to write a clever CS paper). Clever is good, but simple is better.
Another problem I am planning to tackle is extending an existing C++ particle propagation library to track photons through a detector using the GPU. Here, the ability of Fermi to run C++ code (and the future full support for C++) will be a huge time saver, and in addition, the L1 and L2 caches will be tremendously beneficial to caching the parameters describing the locations and orientations of all the geometric shapes that compose the detector. It’s basically a ray-tracing problem, and the caches will be very useful there.
None of my current or previous projects require or even significantly benefit from new Fermi features, but that’s because I picked problems to optimize that were excellent fits to cards going back to the 8800 GTX. Coalesced reads, simple element-wise floating point. (Incidentally, this might be were AMD shines. It sounds like their architecture is highly optimized for this class of calculation at the expense of some flexibility.)
I think I should have used a different term: new code for new problems. By natural selection, most people currently apply CUDA to problems where it was most efficient: looping over large cache-unfriendly data sets with maximally coalesced reads and large amounts of independent floating point arithmetic on those elements. Other kinds of problems were not well addressed in CUDA, and so no one works on them. If a cache was extremely helpful to your problem, it is quite likely you would be using a CPU and not a GPU already. I think this is why there will be a lag in uptake of Fermi: its biggest win is broadening the kinds of problems that can be efficiently solved with CUDA. However, CUDA developers working on these new problems are by definition few because it would have been silly to try to do it before.
Let me give two examples:
A common problem in my area involves needing to recreate multidimensional histograms from a long lists of data (millions of tuples or more, easily). Histogramming is traditionally something that CUDA has been weak at, since it necessarily requires inter-thread coordination of some kind to merge results. Due to the challenge, people have developed reasonable algorithms that use some combination of shared memory, reduction-like techniques, and very limited use of atomics to maximize throughput. These algorithms are necessarily complicated, and embedding them into a larger calculation is annoying.
With Fermi, thanks to the ability of the L2 cache to service atomic operations at an amazing rate, it looks like now I can histogram in the simplest way possible, and still get tremendous performance. This will clean up my code to allow me to focus on the other parts, and generally be a win for me as a developer trying to solve a physics problem (as oppose to a developer trying to write a clever CS paper). Clever is good, but simple is better.
Another problem I am planning to tackle is extending an existing C++ particle propagation library to track photons through a detector using the GPU. Here, the ability of Fermi to run C++ code (and the future full support for C++) will be a huge time saver, and in addition, the L1 and L2 caches will be tremendously beneficial to caching the parameters describing the locations and orientations of all the geometric shapes that compose the detector. It’s basically a ray-tracing problem, and the caches will be very useful there.
None of my current or previous projects require or even significantly benefit from new Fermi features, but that’s because I picked problems to optimize that were excellent fits to cards going back to the 8800 GTX. Coalesced reads, simple element-wise floating point. (Incidentally, this might be were AMD shines. It sounds like their architecture is highly optimized for this class of calculation at the expense of some flexibility.)