Hello, I am currently developing a GPU app. However, my GPU is slower than my CPU. What could be the problem? These are the specs of my comp and project environment:
It’s actually not that hard for a GPU to be a lot slower than a CPU.
A lot of what makes a GPU faster than a CPU depends on things like the size of the data you’re working on and how computationally intense the code is. Small data with few calculations is a poor fit for a GPU, for example. CPUs aren’t as slow as we’d like to think so stuff like this does happen.
Hello, MutantJohn. Thanks for your reply
I’m working on BVH CUDA ray tracing project. I use 200 for both number of blocks and threads. The calculations include checking intersections and shadows using BVH and calculating colors (reflections and refractions too). I think those are quite a lot of computations…
Any other opinions please? Since the difference is 20 secs for a teapot obj… (with and without CUDA)
First thing’s first, make sure you’re compiling with the proper optimization flags. Do no -G or -O0 or anything like that. Balls to the wall -O2 or -O3.
Next up is to profile your code itself. It’s near impossible to look at source code and go, “Oh hey, that’s a bottleneck!”
Okay, it is possible to do that! Some things are so obvious! But not all things are. So you’ll need to use nvprof or nvvp to find the slowest kernel invocations and then you can figure out why those kernels invocations are slow.
The code where you need big number of long threads, such as 1000 threads and they all are long, it a fiction. During my 30 years of experience I never had such task. The real life is when you need large number of short threads with frequent synchronization and you need to sync not 100 times, but 1’000’000 times. This is the real life. These are tasks that I always had and never had to code anything different. I ran test which showed me the CUDA is fake or useless. It can’t do what I need and performs well on the tasks that I don’t need and never needed. The test is: make 1 vector of 1000 elements, make another vector of 200 elements, compute cross correlation (it can be done by 800 concurrent threads), then sync all threads, resort 200 element vector and repeat one million times. CUDA time 24 seconds. Single CPU application, not threading, all serial operations 14 seconds. It is not just a failure, it is miserable failure. So CUDA solved only one problem, which is large matrix multiplications, like 10 000 by 10 000. I spent decades in data science and never faced such task. I always need concurrent serial interleave. Stop doing fiction, start working with real problems.
// divisor_cross_single_launch.cu
// Compile: nvcc -O3 divisor_cross_single_launch.cu -o cross_test
// Run: ./cross_test
//
// WARNING: This kernel runs for maxEpochs iterations entirely on the device.
// On Windows laptop GPUs the OS driver may kill long-running kernels (TDR).
// For large maxEpochs, run on a headless Linux GPU or adjust TDR settings.
static inline void cudaCheck(cudaError_t e, const char* file, int line) {
if (e != cudaSuccess) {
fprintf(stderr, “CUDA error %s:%d: %s\n”, file, line, cudaGetErrorString(e));
exit(1);
}
} define CUDA_CHECK(x) cudaCheck((x), FILE, LINE)
const int nLongDataSize = 1000;
const int nShortDataSize = 200;
const int maxEpochs = 1’000’000;
const int PRINT_EVERY = 100’000;
// Derived
const int N_OFFSETS = nLongDataSize - nShortDataSize; // number of positions to evaluate
// Simple xorshift32 RNG for device (used by thread 0 to shuffle) device unsigned int xorshift32(unsigned int& state) {
unsigned int x = state;
x ^= x << 13;
x ^= x >> 17;
x ^= x << 5;
return state = x ? x : 123456789u;
}
// Single-block kernel: each thread handles one offset index
// Uses shared memory to store per-thread stepMax values and do reduction global void persistent_cross_kernel(const int* d_long, int* d_short, int nLong, int nShort, int epochs, int* d_out_epochMax) {
// single block only
const int tid = threadIdx.x;
const int nOffsets = nLong - nShort;
extern shared int s_counts; // size = nOffsets * sizeof(int) provided at launch
// local registers
int local_best = 0;
// simple PRNG state for thread 0
unsigned int rng_state = 123456789u;
// bounds check: if more threads provided than offsets, idle threads just sync
bool active = (tid < nOffsets);
for (int ep = 0; ep < epochs; ++ep) {
// 1) compute stepMax for this thread's offset
int stepMax = 0;
if (active) {
int base = tid;
// compute inner dot-product: sum long[base + j] * short[j]
// Unrolling not done here to keep clarity
for (int j = 0; j < nShort; ++j) {
stepMax += d_long[base + j] * d_short[j];
}
}
// write per-thread results to shared memory
if (active) s_counts[tid] = stepMax;
// ensure all writes visible
__syncthreads();
// 2) reduction: find max across s_counts -> let thread 0 do serial reduction (nOffsets up to ~1024 ok)
int crossMax = 0;
if (tid == 0) {
// serial scan of shared memory (fast in shared mem)
int tmp = 0;
for (int k = 0; k < nOffsets; ++k) {
int v = s_counts[k];
if (v > tmp) tmp = v;
}
crossMax = tmp;
// update epoch-level max (in register)
if (crossMax > local_best) local_best = crossMax;
}
// make sure thread 0 has computed local_best before continuing
__syncthreads();
// 3) serial shuffle of d_short performed by thread 0 (Fisher-Yates)
if (tid == 0) {
// simple Fisher-Yates shuffle using xorshift rng
// Note: this is serial O(nShort) on device, but executes on GPU avoiding host launch
for (int j = nShort - 1; j > 0; --j) {
unsigned int r = xorshift32(rng_state);
int idx = (int)(r % (unsigned int)(j + 1));
// swap d_short[j] and d_short[idx]
int tmp = d_short[j];
d_short[j] = d_short[idx];
d_short[idx] = tmp;
}
}
// ensure shuffle and local_best updated before next epoch
__syncthreads();
}
// After all epochs, write result from thread 0 to global memory
if (tid == 0) {
d_out_epochMax[0] = local_best;
}
That is certainly a very small problem size for a GPU. Even over a decade ago when I was teaching CUDA, I would give a general rule-of-thumb number of 10,000 threads minimum, for an interesting problem. Nowadays, many/most modern GPUs will prefer larger thread complements than that.
A serial scan in shared memory is generally not fast compared to parallel approaches. A parallel max-finding reduction is not difficult to code. It may or may not impact overall performance much; I haven’t studied your code in any detail.
So 2 of the 3 steps in your code are coded in an entirely serial fashion. A single GPU thread will not be faster than equivalent CPU computation.
As already mentioned, using only a single thread block is not how to get performance out of a GPU.
I have 2 suggestions.
If you want to code this yourself, then there may be some CUDA basics that could be useful. This online tutorial series covers a number of topics for a basic understanding of CUDA programming (including how to code parallel reductions as mentioned above.)
Since part of your inquiry here seems to be questioning the usefulness/applicability of GPUs in general, perhaps you might want to start with an already coded library implementation of a cross-correlation. Nowadays, with the beauty of AI-driven search, you can get a pretty good start with a google search e.g. on “cupy cross correlation” Cupy is a GPU-accelerated library that provides a variety of signal-processing functionality (in cupy.signal), and the google search will also point out that a corresponding cpu-baseline could be done using scipy.signal.
Even if you did suggestion 2 above, you will probably find that scipy is faster below some problem size. With a side-by-side comparison, you could then explore if/when the GPU variant becomes faster (at which problem size). Then you might then have more confidence that certain problems are solvable using GPUs in an interesting way.
None of this directly addresses your domain of problems. If all the problems you are working on are small enough to be uninteresting on a GPU, then that is certainly a good reason not to invest in GPU acceleration. That is really a starting point for any GPU journey.
And there is no doubt that fast matrix-matrix multiply is an important factor in modern GPUs.
Here is an example of what I had in mind (suggestion 2 above). It took me less than 1 hour to create in colab, using AI provided snippets like I suggested earlier:
import cupy as cp
import numpy as np
import scipy.signal as sp
arr_size1 = 100000
arr_size2 = 200
# Create two 1D CuPy arrays
a = cp.ones(arr_size1,np.float32)
v = cp.ones(arr_size2,np.float32)
# Perform cross-correlation with 'full' mode
result_full = cp.correlate(a, v, mode='full')
print("Full mode result:")
print(result_full)
a_sp = np.ones(arr_size1,dtype=np.float32)
v_sp = np.ones(arr_size2,dtype=np.float32)
sresult_full = sp.correlate(a_sp,v_sp, mode='full')
print(sresult_full)
%%time
result_full = cp.correlate(a, v, mode='full')
cp.cuda.runtime.deviceSynchronize()
%%time
sresult_full = sp.correlate(a_sp,v_sp, mode='full')
When I run that in colab with a T4 GPU, and an arr_size1 of 1000, I do indeed observe that the CPU is faster than the GPU. The cupy variant time is 1.11ms and the scipy variant time is ~550us. When I change arr_size1 to 100000 (100x larger), the reported cupy variant time is 1.7ms whereas the reported scipy variant time is 5ms.
You should be able to duplicate that on colab if you wish. Be sure to select the available T4 GPU on the free instance type in the Runtime menu.
The above example does not time the cost of data transfer to/from the CPU (for the cupy variant). That would certainly make the comparison worse, although there may still be a size breakpoint above which the GPU is faster, or maybe not. It didn’t appear to me from your overall problem description that it would be necessary to do that; you appear to be doing repeated steps on data that is mostly already resident on the GPU, but of course I may have interpreted things incorrectly.