# I have two question.

I have two question.

First, Can I use CUDA in GeForce 8800GT chip?

Second, I need to calculate a great deal of float value(100M and every value need to do a division operation) in main memery(not video memery), If I send it to the video memery and calculate with GPU, and then send the result back to the memery, can this process faster than calculated by CPU?

1. Yes, but you’ll need CUDA 1.1
2. It may be faster, and it may be slower. It depends =) Division is extremely costly in GPU. The answer depends on how many other calculations you’re going to perform on theese floats.

The cost of the division will be nothing: it will be completely hidden and processed during the memory latency. Your problem with this 100M division will be in the memory transfers. You can calculate near exactly how long it will take on a GTX (the bandwidth on the GT will be less than 70 GiB/s, I don’t know how much less):
Transfer to/from card: 100 MiB at 3 GiB/s = 0.0325520833 s per transfer
Perform calculation on card: 100 MiB and 100 Mib written at 70 GiB/s: 0.00279017857 s

Total time = 0.0678943452 s
You can compare that to the the time spent on the CPU for the same operation. I would guess that it will be faster on the CPU, or near equal. Note that the time spent computing the divisions is only 4% of the total time. Thus you can see that keeping everything on the GPU and copying back only rarely is the way to get a lot of performance.

Counter-intuitively as it might seem, floating point division isn’t that expensive. Somewhat more than floating point multiplication. Integer division is the worst thing imaginable performance-wise, though. As the G80 series has no integer division unit, it needs to emulate one.

Division is extremely costly in GPU， but it is costly in CPU too.
The parallel float operation is faster in GPU than in CPU isn’t it?

thanks for your helpful answer. so, as far as I understand, GPU can’t cooperate very well with CPU because of the bandwidth between main memery and video memery?

Sorry, you’re right. Question was about floating-point division but I was thinkig about integer division when writing :(

It can cooperate with the CPU fine, just as long as you keep the amount of transferred data to an absolute minimum.

By the way, I wish to know how video card improve the speed of drawing graphic by “keeping everything on the GPU and copying back only rarely”. I think, if the CPU give an instruction of “draw line” or “draw circle”, and only send the necessary parameter to the video card, it can achieve that goal. But if we need to draw a bitmap, how the GPU raise the speed?

And, In my opinion, if we hope to use the GPU as a generl computing element, the bandwide is the bottleneck. the narrow bandwide make the process divided into two phase and the ligament between them is very narrow. So, if we can widen the bandwide or even sharing the identical memory between GPU and CPU, the times of generl computing by GPU is come

By copying the bitmap over to the GPU once and then using it for many many frames of animation. Many textures in games are used the entire time the game is running: others may only exist in some rooms. Ever noticed a stutter the first time you walk into a different room in a game with textures you haven’t seen before? That is (in part) due to copying the new bitmaps over to the GPU.

Sharing identical memory would be subject to the same bandwidth limitation. You are still linking over PCI-Express and even if you ignore that, the DDR RAM connected to the CPU is extremely slow compared to the RAM on the GPU.

It is true that some algorithms might simply require a lot of memory copies CPU<->GPU, and that is unfortunate for them and they simply won’t map to the GPU well. The big challenge then becomes to put every step of the whole program on to the GPU, thus avoiding the GPU->CPU memory copies except for initialization and to copy results back when needed.

I mean that, take the GPU beside the CPU, and, if the DDR2 main memory is slow, I think we can use DD4 or 5 individually. But the point is that CPU and GPU can access the identical memory rapidly. Of course, It will change the computer architecture extremely, But if it is a great advancement, why don’t we do that?

You are describing something that sounds similar to what AMD/ATI are planning to do with their Fusion CPU in 2009. The idea is to put normal CPU cores and GPU elements onto the same piece of silicon, eliminating the PCI-Express link that separates them now.

But, one thing to appreciate is that today’s GPUs have much more bandwidth to video memory than CPUs have to main system memory. If you put a GPU next to CPU, some GPU programs would run slower, unless you also increased the size and speed of the system memory bus.

This is why the 8800 GTX has been such a huge win for my particular application. I have to loop through hundreds of megabytes of data over and over again, so even a large CPU cache is not very helpful. But the enormous memory bus of the GTX (and 768 MB of video memory) speeds this up quite a bit.