Persistent kernel runs slower when with more threads

Hi, I’m new to CUDA, and I’m trying to use a persistent kernel to reduce the scheduling and synchronizing time for a high frequency operation, but the kernel seems to run slower when I increase the total number of threads, while the one-time kernel does not change a lot. What is the reason for this issue, and how can I fix this? Thank you.

#include <cuda_runtime.h>
#include <stdio.h>
#include <iostream>

__global__ void warmup(int* data){
	int id = threadIdx.x + blockIdx.x * blockDim.x;
}


__global__ void onetime_kernel(int* data){
	int id = threadIdx.x + blockIdx.x * blockDim.x;
	data[id] = id;
}

__global__ void persistent_kernel( volatile int* exit_flag,volatile int* check_flag, int* data){
	int id = threadIdx.x + blockIdx.x * blockDim.x;
	
	while(*exit_flag==1){
		if(*check_flag==1){
			data[id] = id;
			__threadfence_system();
            if(id==0) *check_flag = 0;
		}
	}
	
}


int main(int argc, char **argv){

	int block;
	int thread_per_block;

	if(argc==3){
		sscanf(argv[1],"%d",&block);
		sscanf(argv[2],"%d",&thread_per_block);
	}
	else{
		block = 1;
		thread_per_block = 20;
	}

	int threads = block*thread_per_block;
	printf("block: %d; thread_per_block: %d; total_threads: %d\n", block, thread_per_block, threads);

	int *data_;
	cudaMallocManaged(&data_, sizeof(int)*threads);
	for(int i=0; i<threads; ++i) data_[i] = 0;

	cudaStream_t *stream_;
    stream_ = (cudaStream_t*) new cudaStream_t[1];
    cudaStreamCreate(&stream_[0]);

    cudaSetDeviceFlags(cudaDeviceMapHost);
	volatile int *exit_flag_d, *exit_flag_h;
    cudaHostAlloc((void **)&exit_flag_h,sizeof(int),cudaHostAllocMapped);
    cudaHostGetDevicePointer((int **)&exit_flag_d, (int *)exit_flag_h, 0);

    volatile int *check_flag_d, *check_flag_h;
    cudaHostAlloc((void **)&check_flag_h,sizeof(int),cudaHostAllocMapped);
    cudaHostGetDevicePointer((int **)&check_flag_d, (int *)check_flag_h, 0);

	warmup<<< 32, 64, 0, stream_[0] >>>(data_);
	cudaDeviceSynchronize();

	onetime_kernel<<< block, thread_per_block, 0, stream_[0] >>>(data_);
	cudaDeviceSynchronize();
	for(int i=0; i<threads; ++i) data_[i] = 0;

	*exit_flag_h = 1;
	*check_flag_h = 0;	

	persistent_kernel<<< block, thread_per_block, 0, stream_[0] >>>(exit_flag_d,check_flag_d,data_);

    clock_t time_1 = clock();

	*check_flag_h = 1;
	while(*check_flag_h == 1);

	clock_t time_2 = clock();
	float timer =  (float)(time_2-time_1)/CLOCKS_PER_SEC;

	printf("timer: %f(ms)\n", timer*1000);

	*exit_flag_h = 0;
	cudaDeviceSynchronize();
}

And I got the following results:

$ nsys profile --stats=true ./test
block: 1; thread_per_block: 20; total_threads: 20
timer: 0.006000(ms)
...
Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)     GridXYZ         BlockXYZ                               Name                          
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  --------------  --------------  --------------------------------------------------------
     80.6           33,952          1  33,952.0  33,952.0    33,952    33,952          0.0     1    1    1    20    1    1  persistent_kernel(volatile int *, volatile int *, int *)
     11.0            4,640          1   4,640.0   4,640.0     4,640     4,640          0.0     1    1    1    20    1    1  onetime_kernel(int *)                                   
      8.4            3,552          1   3,552.0   3,552.0     3,552     3,552          0.0    32    1    1    64    1    1  warmup(int *)                                           
$ nsys profile --stats=true ./test 8 32
block: 8; thread_per_block: 32; total_threads: 256
timer: 0.011000(ms)
...
Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)     GridXYZ         BlockXYZ                               Name                          
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  --------------  --------------  --------------------------------------------------------
     83.5           43,072          1  43,072.0  43,072.0    43,072    43,072          0.0     8    1    1    32    1    1  persistent_kernel(volatile int *, volatile int *, int *)
      9.3            4,800          1   4,800.0   4,800.0     4,800     4,800          0.0     8    1    1    32    1    1  onetime_kernel(int *)                                   
      7.2            3,712          1   3,712.0   3,712.0     3,712     3,712          0.0    32    1    1    64    1    1  warmup(int *)                                           
$ nsys profile --stats=true ./test 16 32
block: 16; thread_per_block: 32; total_threads: 512
timer: 0.022000(ms)
...
Time (%)  Total Time (ns)  Instances  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)     GridXYZ         BlockXYZ                               Name                          
 --------  ---------------  ---------  --------  --------  --------  --------  -----------  --------------  --------------  --------------------------------------------------------
     88.0           60,960          1  60,960.0  60,960.0    60,960    60,960          0.0    16    1    1    32    1    1  persistent_kernel(volatile int *, volatile int *, int *)
      6.8            4,736          1   4,736.0   4,736.0     4,736     4,736          0.0    16    1    1    32    1    1  onetime_kernel(int *)                                   
      5.2            3,584          1   3,584.0   3,584.0     3,584     3,584          0.0    32    1    1    64    1    1  warmup(int *)                                           

$ nsys profile --stats=true ./test 16 128
block: 16; thread_per_block: 128; total_threads: 2048
timer: 0.090000(ms)
...
Time (%)  Total Time (ns)  Instances  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)     GridXYZ         BlockXYZ                               Name                          
 --------  ---------------  ---------  ---------  ---------  --------  --------  -----------  --------------  --------------  --------------------------------------------------------
     95.4          169,600          1  169,600.0  169,600.0   169,600   169,600          0.0    16    1    1   128    1    1  persistent_kernel(volatile int *, volatile int *, int *)
      2.7            4,864          1    4,864.0    4,864.0     4,864     4,864          0.0    16    1    1   128    1    1  onetime_kernel(int *)                                   
      1.9            3,360          1    3,360.0    3,360.0     3,360     3,360          0.0    32    1    1    64    1    1  warmup(int *)                                           
$ nsys profile --stats=true ./test 16 512
block: 16; thread_per_block: 512; total_threads: 8192
timer: 0.320000(ms)
...
Time (%)  Total Time (ns)  Instances  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)     GridXYZ         BlockXYZ                               Name                          
 --------  ---------------  ---------  ---------  ---------  --------  --------  -----------  --------------  --------------  --------------------------------------------------------
     98.4          546,208          1  546,208.0  546,208.0   546,208   546,208          0.0    16    1    1   512    1    1  persistent_kernel(volatile int *, volatile int *, int *)
      0.9            5,120          1    5,120.0    5,120.0     5,120     5,120          0.0    16    1    1   512    1    1  onetime_kernel(int *)                                   
      0.6            3,488          1    3,488.0    3,488.0     3,488     3,488          0.0    32    1    1    64    1    1  warmup(int *)                                           

FYI, I’m working on a Jetson Orin NX 16Gb with Jetpck 5.1.1 and CUDA 11.4.

This would appear to happen because of the use of __threadfence_system(). This ensures that preceding writes by one thread are made visible to all other threads in the system. That is expensive in general, and more so as the number of threads increases. In my mental model I would class this as synchronization overhead whose cost increases with the number of entities that need to be synchronized, and such increase may well be more than linear.

In contrast, warmup is essentially a null kernel (the id computation is optimized away) that measures basic kernel overhead, and onetime does a minimum of “fire and forget” memory activity using a smallish number of threads, meaning that while the execution time increases linearly with the number of threads it is still dominated by basic kernel overhead.

I am before my first coffee of the day so my insights may be limited. Best to wait for other forum participants who may provide more cogent comments.

Hi there, thank you for your responding. I actually ran the code with commenting out the __threadfence_system() just to see if it’s the reason, but the results or the behaviors are pretty similar, it didn’t even change the numbers by a lot. I know that __threadfence_system() can be time consuming, but I need some synchronizing methods to have the data ready for the CPU to access, I also tried __syncthreads() and let the first thread of each block to atmoicAdd to a flag for the host, and it does not change the behavior by a lot either.

I don’t have a Jetson GPU to work on. So you may get better results with Jetson questions by asking on a Jetson forum. Since Jetson has unified physical resource for host and device memory, its possible that there could be some differences in relative effects when performing host-device communication via (pinned) memory, as you are doing.

But when I run your code on my L4 GPU I see roughly similar behavior for the persistent kernel. In a nutshell, I would describe your code as waiting for thread 0 to set a memory location (to zero) and then your kernel is signalled to exit. So we can break this (perhaps) into three pieces:

  1. The time it takes your kernel code to set the check_flag to zero.
  2. The time it takes for host code to recognize this, and set exit_flag to zero.
  3. The time it takes for your kernel to exit, after it has been signalled to exit.

Item 2 is probably a relatively fixed duration, I would guess.

I think for item1, it stands to reason that as we give the warp-schedulers more options, it may take longer for it to to get to scheduling the warp that has the thread whose id is zero, all the way to the point of setting the check flag to zero. Indeed, if I simply remove the if (id == 0) test from that case, then the persistent kernel duration drops by half. So I would say a big part of the increase is due to the time it takes to schedule the warp that includes id of zero, and it seems reasonable to me that this time would tend to larger values as the grid increases in size.

For item 3, it also seems sensible to me that as I add more threads to the grid, it is going to take more time to get all those threads to the outer while statement that is reading the exit_flag. So I would expect this time to generally increase as the grid is made larger. If we had 2 warps per warp scheduler, and we imagined that a warp is already ready to issue, then I would expect that warp schedule to take twice as long to get twice as many warps to the same point.

I guess part of your question then might be “why don’t the other kernels behave the same?” The simplest explanation is that those other kernels exhibit no dependencies of any kind. Nothing prevents immediate issue and retirement of each thread.

Thank you, that makes a lot sense. Do you have any recommendation for fixing the item1, with changing a flag as a signal for the host in some way? For my application, I need to repeat the process between time_1 and time_2 for hundred of times before setting exit_flag to 0. And the host needs to know when all the thread finishes their operation, and read from the data every check_flag. If if (id == 0) is removed, is there gonna be some type of thread conflicts?

I don’t think there is anything to fix. If you want to wait until all threads do some piece of work before the signal, then you need a synchronization of some sort (the if (id == 0) is not sufficient for that), and certainly as you have more threads doing the work, my expectation is that it will take longer.

If you believe that you can increase the amount of work a GPU does and there should be no increase in kernel execution time, if it were me I would reset that expectation.

But the one-time kernel does not take a lot longer when with more threads, is it because persistent kernel has a different mechanism?

Before we talk about a specific case, I would like to repeat what I consider to be a general principle: If you ask the GPU to do more work, it seems quite reasonable to me that the general trend would be increasing execution time.

My guess on the outliers you have presented is that they are behaving that way because:

  • there is no kernel code that prevents each thread from being immediately issued and retired (what I already stated) and
  • you have not modified grid sizes enough to see the general trend that I referenced first in this response

Your persistent kernel code has kernel code that waits on host activity (e.g. the while loop(s)). Such code cannot be immediately issued and retired without other conditions. Such a description does not apply to your other two kernels.

I’m fairly confident that if you continue to increase the grid size of your other two kernels, you will eventually see the duration of those kernels increase also, consistent with my initial claim in this response that as a general trend, asking the GPU to do more work in a kernel is likely to result in the duration of that kernel to increase.