Why is my single thread GPU speed 1000x faster than my CPU?

I’m writing an application in which most of the time it will be running single threaded, but sometimes I will want to run it multithreaded on the GPU.

I’m doing some benchmarks to find out how much slower single thread performance is on my GTX 1070 vs my i5 2500k.

I’m finding that my single threaded GPU test is as fast as the gpu multithreaded test. I’m also finding that the CPU single thread version is much slower than the GPU single threaded. This is the opposite of what I expected so I’m hoping someone can find my mistake.

I am calling the single thread GPU function with:

auto start_time = std::chrono::high_resolution_clock::now();

gpuSingleThreadVectorAdd<<<1, 1>>>(d_A, d_B, d_C, numElements);

auto end_time = std::chrono::high_resolution_clock::now();
auto time = end_time - start_time;

From what I understand, the <<<1, 1>>> should force it to run in a single thread, although I could be wrong and that may be my mistake.

The functions being tested are:

__global__ void
gpuSingleThreadVectorAdd(const float *A, const float *B, float *C, int numElements)
{
	for (int i = 0; i < numElements; i++)
	{
		C[i] = A[i] + B[i];
		for (int x = 0; x<100; x++)
			C[i] = sqrt(C[i]);
	}
}

void cpuVectorAdd(const float *A, const float *B, float *C, int numElements)
{
	for (int i = 0; i < numElements; i++)
	{
		C[i] = A[i] + B[i];
		for (int x = 0; x<100; x++)
			C[i] = sqrt(C[i]);
	}
}

The timing results are:

gpu single threaded time: 20 to run.
cpu single threaded time: 33120 to run.

What am I doing wrong, is there a way to double check that it really is running as a single thread on the GPU?

Some more ideas on this:

1.) I wasn’t sure if after the end of the gpuSingleThreadVectorAdd() call that it actually had completed the computation, and that maybe it’s asynchronous until your sync it back up somehow.

When I include the final cudaMemcpy() in the time measurement, then it gives a time about 10x what the CPU time is which is what I expected. However, I don’t want to include the bandwidth delay of copying data over the PCI, I am looking for a test that just gives the compute time.

2.) When I increase the sqrt loop from:

for (int x = 0; x<100; x++) C[i] = sqrt(C[i]);

To:

for (int x = 0; x<100*100; x++) C[i] = sqrt(C[i]);

It freezes the entire system and crashes the program. When it comes back there’s a pop up box that says the display driver crashed and recovered.

Is there a limit to how much time a single GPU thread can take? How can I keep it from crashing like that when I have a lot to do in a thread?

Yes, GPU Kernel launches are asynchronous. If you want to time just the kernel launch using host timing constructs, include cudaDeviceSynchronize(); immediately after the kernel launch, within your timing region.

Yes, GPUs under windows that are in WDDM mode are subject to the WDDM TDR mechanism which limits GPU kernels to approximately 2 seconds. This topic is covered in lots of places, so googling “cuda wddm tdr” will get you lots of information, and there is a sticky forum thread topic on it:

https://devtalk.nvidia.com/default/topic/459869/cuda-programming-and-performance/-quot-display-driver-stopped-responding-and-has-recovered-quot-wddm-timeout-detection-and-recovery-/

Thanks very much, that clears things up considerably.

From reading about WDDM, is there any other way around the problem other than to process the data in small units and cudaDeviceSynchronize() after each chunk? Such as:

int zLoop = 100;

for (int z = 0; z < zLoop; z++)
  {
  gpuSingleThreadVectorAdd<<<1, 1>>>(d_A, d_B, d_C, numElements);

  cudaDeviceSynchronize();
  }

It seems like it’s a matter of luck how long a batch might take and if it triggers the timeout.

Is the correct way to handle this for each thread to use something like:

clock_t start_time = clock();

…to tell if it’s in danger of triggering the timeout and bailing?

Inserting a call to cudaStreamQuery(0) into the loop has the side effect of immediately launching the kernels batched up so far. As the timeout applies to the entire batch, this prevents the batch duration from growing out of control as long as the run time of individual kernels is sufficiently short:

int zLoop = 100;

for (int z = 0; z < zLoop; z++)
{
  size_t offset = z * numELements;
  gpuSingleThreadVectorAdd<<<1, 1>>>(d_A + offset, d_B + offset, d_C + offset, numElements);

  cudaStreamQuery(0);
}

(apart from the fact that you would not run a single thread kernel, particularly not in a tight loop like this - use multiple threads and blocks).

The usual approach for limiting runtime of individual kernels is to design them for a runtime of tenths of seconds on the slowest devices, so that there is as safety gap of an order of magnitude before the kernel crashes. This also keeps the GUI somewhat responsive during periods of heavy CUDA activity.

Controlling the runtime of individual threads like you have shown would be appropriate for more advanced techniques like persistent threads, although those would likely be used on dedicated GPUs anyway

Thanks, that helps a great deal. In experimenting with cudaStreamQuery(0) it looks like it runs the batches asynchronously? For simulation I’m programming, the result of time(x) depends on the values computed at time(x-1). So it all has to be run linearly.

I can only use a multithread setup when I want to optimize the results from using various simulation input combinations, however each simulation has to be run single threaded. The optimization process will only happen rarely. 90% of the time it will just be running a single thread.

So, is it correct that I would actually use cudaDeviceSynchronize() to ensure things are not computed out of order?

It seems like if I could use cudaStreamQuery(0) then I would be better off using the full <<<blocksPerGrid, threadsPerBlock >>>?

CUDA kernels within the same stream run synchronously relative to each other. Replacing cudaDeviceSynchronize() by cudaStreamQuery(0) wakes them run asynchronously with respect to the host.

It is note quite clear to me yet what you are after. Generally, single threaded workloads are best run on the CPU. The only reason I can immediately think of for running them on the GPU is if their input data is already located on the GPU, or out will be required on the GPU, and this saves data transfer across PCIe.

Theoretically, GPU kernels can also take advantage of the higher memory bandwidth as compared to the CPU. In practise, it’s highly unlikely though that a single threaded kernel would be able to exploit this though.

Can you explain a bit more what advantage you are intending to gain by running on the GPU?

Thanks, this is for a trading system, so the idea is that as real time trading data comes in the system operates in single thread mode.

The rules for the system would be optimized from time to time, in which case it’s useful to try many different combinations of input parameters to see what works the best, or most reliably.

Once an acceptable set of parameters are found, the real time trading system uses those parameters to trade in single threaded mode.

By “single thread”, are you referring to a single GPU thread? If so, I don’t see how the GPU provides any benefits for a real-time trading system which is presumably sensitive to latency. By the time incoming data arrives at the GPU, you could have already computed a result on the CPU.

Amdahl’s law would indicate that any serial portion of a workload is best assigned to a CPU with high single-thread performance, for example a 4.0 GHz (4.5 GHz boost clock) Kaby Lake CPU that was just released a few days ago. GPUs are best suited for throughput-oriented, latency tolerant, parallel processing with tens of thousands of threads.

While I am very much in favor of exploring the use of GPUs using unconventional approaches, I think it is also important to use the best tool for a job, which in case of a serial, latency-sensitive, workload would be the CPU.

And what benefit are you expecting from running the single-threaded part on the GPU, rather than the CPU?

The timeout problem is indeed an annoying one. The good news is Pascal devices have a significant new hardware feature (preemptive scheduling) which will remove the need for the GUI watchdog killer and allow any length kernels to run transparently without interrupting the OS windowing system. The feature is touted in detail in both Pascal whitepapers.

I use the future tense “will remove” instead of “removes” because this hardware ability is not yet activated by NVidia’s driver software.

I bring this up in this apropos thread mostly to remind developers that the watchdogs will hopefully be retired soon. Of course it’s been over 6 months since Pascal was launched and the software side hasn’t been released yet but we’re all hopeful.

I this really a question of updating NVIDIA’s drivers?

Since the GUI watchdog timer is a feature of all three operating systems supported by CUDA, isn’t this rather a question of updating the operating system to query the graphics driver to see whether use of a watchdog timer is necessary, and disabling its watchdog timer when the driver reports that fine-grained pre-emption is supported for a given GPU?

I would speculate that none of the three operating systems currently define a way to report the preemption capability, and therefore OS watchdog timers remain in force even for GPUs that theoretically no longer need it.

That’s a very good question and of course I don’t know so I’m just making educated guesses.

In Linux there is no WDDM abstraction, so I would expect NVidia’s display driver to be in total control. The watchdog timeout killer can in fact be disabled in Linux by an xorg.conf setting, which leads me to believe the scheduler for both display and compute is inside the NVidia driver and not the OS.

These are good points, but the type of trading I’m doing is not very sensitive to latency. In fact I specifically avoid trading when the market is moving too quickly. Network latencies are orders of magnitude larger than the difference in latency caused by GPU vs CPU computing and the amount of computation that is needed to update the trading state from a single new tick worth of trading information is infinitesimal.

Mainly that I only would have to write my code once. I would like to not have to maintain a CPU version and a GPU version. PyCUDA might be a way to do that. Everything I have now is in C# so if I can find a CUDA binding for C# or Java that makes it easy to do the “Scripting: Interpreted, not Compiled” type of programming that PyCUDA has, it would be easier to do the port.

From my testing it looks like my GPU is about 12x slower single threaded than my CPU, when I split the work into chunks that do not cause my display to become nonresponsive. It might be easier to just keep everything in C# and do large scale optimizations on EC2 or something similar.

I can see a couple of reasons for using CUDA, all of them performance-related, but ease of coding is not amongst them. :)
It is not particularly difficult either, but it will never be simpler than coding on the host.

CUDA wins hands down in a number of scenarios in the category “coding effort required to reach a certain performance”, but if you don’t care about performance at all the host will always win. For one, all input/output has to be transferred from the host to the GPU and back. Similarly, any operating system calls have to be made on the host.

The CUDA C way of using the same code for both host (CPU) and device (GPU) is to put it into functions or methods declared as “host device”.

I prefer CUDA C to the newer language bindings that are emerging recently because it is mature. Your mileage may vary of course.