My GPU is always slower than the CPU

Hello everybody!

In my initial triangle-triangle collision tester the GPU always performed better than the CPU, usually being times faster (when using more than 1000000 triangles). Then I decided to optimize the algorithm by testing only those triangles, whose bounding spheres intersect. This made the calculation ~10x faster. However, now the CPU always performs faster than the GPU. For the sake of an example, I’m including parts of my code below.

Following code is being executed on the GPU:

__global__ void kernel(const triangle_t *ref,const triangle_t *triArr,int arrSz,
						uint8_t* resArr) {
	// if config is set to compute all intersections then the if block executes always,
	// otherwise the computation is skips once an intersection has been found
	if((devFlags & COMP_ALL_ISECTS) || !devIsect) {
		// Calculate the index of the triangle for the current thread to compute intersection
		int i=blockIdx.x * blockDim.x + threadIdx.x;

		// Control that we're not out of the array
		if(i<arrSz) {
			// Default value = no intersection


Almost the same function on the CPU:

int8_t TriTriIsectDetCpu(triangle_t ref,triangle_t triArr[], int
	arrSz, uint8_t resArr[]) {
	uint8_t x=0; // To remember an intersection occurred, default=0

	for(int i=0;i<arrSz;i++) {
		if(DoSpheresIntersect(ref,triArr[i])) // If the spheres are colliding...
			// ... the triangles might be colliding too
			if(DoTrisIntersect(ref,triArr[i])) {
				if(!(flags & COMP_ALL_ISECTS))

	return x;
__host__ __device__ uint8_t DoSpheresIntersect(triangle_t T0, triangle_t T1) {
	float dist = T0.rad + T1.rad;
	float rad;

	rad = sqrt( (( - * ( - +
				(( - * ( - +
				(( - * ( - );

	if(rad < dist) return 1;
	else return 0;

I can include more code if you wish. I’m urgently asking for your help.


does this not spell a race condition for the gpu side

if so, the cpu already has the advantage that it can start faster; it would equally have the advantage that it can stop quicker, particularly with a true condition occurring early on, as it is up to chance whether the gpu would stop on the same condition

Out of curiosity, is your timer just the specific code blocks or the entire program? Host to device transfers might explain the discrepancy. Otherwise, I’d say that you should run your code through a profiler. I think nvvp comes with CUDA automatically, right?

The timer is part of the program. Maybe the problem is that after the optimization the GPU only checks the possibly colliding triangles, which are lot fewer than 1000000, whereas the GPU algorithm turned out to be ineffective when applied for <1000000 triangles.

I think before putting this algorithm onto the GPU, you may want to consider algorithmic improvements.

Typically smart intersection algorithms would first subdivide your huge triangle list into smaller subsets that have a chance of intersecting each other (using a broad phase collision/intersection algorithm) that are then checked more thoroughly for intersections (narrow phase):

The main idea is to lower complexity by reducing the number of triangle pairs that need to be checked against each other.

I recommend this good book on the general topic: