Why does this simple program take more time to run on more threads? (nvprof)

Hello everyone
I’m following this tutorial.

Running the program:

#include<iostream>
#include<cmath>

__global__
void add(int n, float *x, float *y){
	for (int i = 0; i < n; i++)
	y[i] = x[i] + y[i];
}

int main(){
	int N = 1 << 20;

	float *x, *y;

	cudaMallocManaged(&x, N * sizeof(float));
	cudaMallocManaged(&y, N * sizeof(float));

	for (int i = 0; i < N; i++){
		x[i] = 1.0f;
		y[i] = 2.0f;
	}

	add<<<1,1>>>(N, x, y);

	cudaDeviceSynchronize();

	float maxError = 0.0f;
	for (int i = 0; i < N; i++){
		maxError = fmax(maxError, fabs(y[i] - 3.0f));
	}

	std::cout << "Max error: " << maxError << std::endl;

	cudaFree(x);
	cudaFree(y);

	return 0;
}

compiling the code using nvcc hello.cu and running profiling nvprof ./a.out , I got:

GPU activities: 100.00% 186.86ms 1 186.86ms 186.86ms 186.86ms add(int, float*, float*)

Now changing <<<1,1>>> to <<<1,256>>>:

#include
#include

global
void add(int n, float *x, float *y){
for (int i = 0; i < n; i++)
y[i] = x[i] + y[i];
}

int main(){
int N = 1 << 20;

float *x, *y;

cudaMallocManaged(&x, N * sizeof(float));
cudaMallocManaged(&y, N * sizeof(float));

for (int i = 0; i < N; i++){
x[i] = 1.0f;
y[i] = 2.0f;
}

add<<<1,256>>>(N, x, y);

cudaDeviceSynchronize();

float maxError = 0.0f;
for (int i = 0; i < N; i++){
maxError = fmax(maxError, fabs(y[i] - 3.0f));
}

std::cout << "Max error: " << maxError << std::endl;

cudaFree(x);
cudaFree(y);

return 0;
}

I got

GPU activities: 100.00% 228.32ms 1 228.32ms 228.32ms 228.32ms add(int, float*, float*)

Isn’t running more threads supposed to make the program runs faster?
This is the exact opposite of what is written in the tutorial.
Why is it so? Am I doing something wrong here?
Thanks in advance

Just continue with the tutorial.
The given kernel does not distribute the workload among the threads. The kernel code will be executed by each thread, so each thread will loop over all elements. Which is why it must only be called with 1 thread and one block (<<<1,1>>>) because otherwise it would not work correctly.

1 Like

As striker159 said above, the kernel function does not distribute workload across threads. Whether you’re using 1 or 256 threads, the workload per threads is constant. So the more threads you have, the more work the GPU will need to perform.

And with any form of parallel processing (whether on CPU or GPU), spawning new threads always comes with runtime overhead. You will typically need to run a large workload before you notice a significant improvement in runtime performance compared to just computing everything in series.

Why don’t you change your kernel function as follows:

__global__
void add (int n, float *x, float *y) {
    int base = n * threadIdx.x;
    for (int i = base; i < base + n; i++) {
        y[i] = x[i] + y[i];
    }
}
}

and pass a your batch size as n, rather than your array size, i.e. array size / thread count.
This would result in the workload being distributed across threads and maybe result in better scaling.
It would be interesting to see what happens here.