Performance gap for a short test code between GPU and CPU

Here is a test code I run both in GPU(K80, 823MHZ) and CPU (i7-6500U 2.5GHZ),I tested the performance under 1 thread case, i.e., GPU and CPU both used 1 thread to run the code. The code is quite simple:

unsigned long long B, test;
B = 3880586647;
test = 0;
for (unsigned long long i = 1234567890; i < 1244567890; i++)
{
	test = test + (B + test) * (i);
}
printf("test is: %llu\n", test);

1, Please ignore the overflow of the calculation. And both GPU and CPU output the same result: 18446744069828964969.
2, GPU costed 828ms, CPU costed just 19ms, nearly 50 times faster for CPU, even considering the clock frequency gap, there is still 16 times gap.

So I wonder if GPU is suitable for this kind of computing or not. In the code you can see the iteration logic for variable ‘test’. Or why it is slow for this case in GPU? Or if there is any optimization for this kind of code? Thanks.

Additional info:
1, Enviroment: Cuda 9.0.176 + Visual Studio 2017 community
2, I confirmed both programs (cu for GPU and c for CPU) are built using ‘X64’ and ‘Release’ option.

CPU is designed to run tens of threads in parallel. GPU is designed to run tens of thousands of threads in parallel. There’s no point in running a GPU in a single-thread configuration.

Seriously what do you expect from using 1 CUDA core out of e.g 3.584 cores on a GTX 1080Ti?

if “this kind of computing” means running single threaded, then the answer is a clear no. The GPU is not suitable for running single (or just a few) threads.

The GPU speed gain comes from using all of them at once.

Christian

This looks like it’s intended to be some sort of reduction; if so, then yes, you can parallelize it quite well with CUDA. Check out thrust::reduce (https://thrust.github.io/doc/group__reductions.html).

The purpose of the test is to check the computing performance for a single core of GPU, before I leverage all of the cores.

And I found when using code like ‘test = test + (B + test) * (i);’ , i.e. calculation result goes as operands in next iteration, and complex enough to make sure compiler is not able to optimize the code, so as let the core conduct the instructions one by one. The weird thing is, CPU appears a much better performance than GPU, even all the instructions are commonly used arithmetical instructions.

The real case I will use is: push thousands of data to thousands of threads for parallel, and each thread compute one data with the same logic. At this moment, I firstly evaluate how it will be when just using one single thread/ single core.

GPUs are inherently parallel processors.

Do you intend to parallelize that loop?

If the answer is no, then a GPU is not for you.

If the answer is yes, then you should read up some on CUDA :).

I think you’re approaching this problem the wrong way; you’re thinking about single threaded programming, but GPUs are /not/ like CPUs. When you’re programming with CUDA, everything is parallel from the start.

Thanks blelbach. But I don’t intend to parallelize that loop in the example, because the instructions in that loop have dependency. So I think maybe let it be in that loop is a relatively faster way than splitting the loop to other cores, as it involves memory sharing/copy to pass data which may be in a slower way. That’s my initial thought and not sure if it’s correct.

The actual loop count is smaller (maybe 2048), I increase the loop count to 10,000,000 in the example just for the convenience to scale the execution time for comparing the gap.

However, I can use each core to execute business request in parallel, and each request is processed by logic which has such loop with instruction dependency.

When I look at the number in the example, the loop count is 10M, assume the logic takes around 10 instructions, and ideally assume 1 instruction costs 1 clock cycle, so totally it costs 100M clock cycles. The frequency of the core is 823MHZ, seems there is a space to improve the performance, that’s what I concerned. If there is, that will be really exciting.

But I don’t intend to parallelize that loop in the example, because the instructions in that loop have dependency.

Dependency on what? Your local variable “test” or the loop counter? Problems like this loop are exactly what reduce operations are designed to solve. Even if you don’t use the thrust library, a loop such as this can be easily converted to parallel using an atomic add operation.

As others have said, it makes no sense to try to compare a single-threaded GPU program with a single-threaded CPU program. Instead of ignoring the advice others have given and insisting that your comparison is valid, read a little about GPU programming so you understand the programming model.