I hope to know that, why GPU faster than CPU in Ge

I hope to know that, why GPU faster than CPU in General computing, it only depend on its parallelism? I think the speed of CPU is not limit by CPU itself, but limit by the speed of accessing the main memory, so is the GPU can slove this problem very well?

You should understand the difference of architecture between GPU and CPU,may be should read the CUDA or Cg refernece.but it may be lower than CPU if the data be produced is small.

I hope to discuss this topic more close to essence. Actually, I hope that compare between CPU and GPU can prove my view point. That is GPU use the potential of the programme, such as parallelism and what people often forget, I call it “access memory nearby”. It mean that, if we access the nearest memory or general speaking access the storage, we will more quickly if we access the nearest one. I think that, this characteristic is as important as parallelism. And along with the development of software and hardware, this feature should be more regard. Of course, we can see some clue currently, such as hard disk, It is well known that we can access more quickly if we deal with the data nearby in hard disk. And as far as I know, some thing alike in video card. The ALU can access its cache quickly,and access the caches nearby through the shared memory will take more time. and the worest is that communicate with the ALU not in this group, it will use the DRAM. Although pay attention to the memory near or not will make programmer think a lot, but this is indubitably the way to hugely imporve the speed. And I think, this characteristic is avoidless, just like parallelism. And many programme have this potential characteristic, and some ones do not. So, if the programme has this potential, I think its speed will be raised a lot by specific hardware architecture like GPU. So, when I think back to the essence of why GPU do some task will hugely faster than CPU in depite of their hardware cost are identical. I think it is because the GPU use the the potential of programme, include parallelism and “access memory nearby”.

You have really lost me here. Why don’t you just post a statement that people can react to/disprove. Here is my statement (mostly based on reading this forum for a month or 3):

The reason a GPU is much faster than a CPU at parallel tasks is because the GPU has much more ALU’s. I think it has much less to do with memory accesses (although the fact that aligned shared memory accesses are as fast as registers does help offcourse)
Another thing that is important is thread-switching is (almost?) no cost, since each thread has its own registers, whereas on a CPU this is not the case.

That’s just not true.

Yes, there are some algorithms that are limited by available bandwidth, but there are others which are not.

Some algorithms can benefit from running on GPU, some can not.

CUDA isn’t a magical stuff which you can include in your project and get 100x speedup. You have to understand parallel programming and limitations of technology to get some benefit from using it.

What I emphasize is that “GPU use the potential of the programme”, and this potential include parallelism and “access memory nearby”.

In fact, I discuss the CPU and GPU from software point of view. We all know that if we can use more ALU to calculate the task, it will cost less time, but how can we do this? If we now have 16 ALU, and tomorrow have 32, and next week we have 64? Of course, now we can solve this problem pretty by using some tool like openMP and MPI. This because we can set up a abstract manner to “express” parallelism. But if now the ALU0 connect the ALU16 through shared memory, and tomorrow we connect them through cache. this change will cause the large change of the core, but in fact, there is no change essentially. The essentially is that the ALU(or general speaking the Arithmetic Unit) can connected and exchange data with other ALUs nearby more quickly. If we set up some abstract manner like this, we don’t need to change the core when we change the hardware. We just write some core to “express” this feature(access memory nearby or connect with the Arithmetic Unit nearby), and this essential feature will not change along with the hardware, no matter how many layer of the storage architecture, and how fast they are.