CPU multi-thread code causes my kernel function to run very slowly

My program encountered a very strange bug. When I directly execute the kernel function in the main function, the program runs prefectly; however, when I add CPU multithreading code, even when the other CPU thread and the kernel function are not interfering with each other and the other CPU thread only uses very little resources, the execution of the kernel function becomes very slow.

My code is just like:

#include <thread>

int count=0;
void buffer_thread(){
    while (count < 640) {
        Sleep(20);
    }
}

int main(){
     std::thread buffer_t(buffer_thread);
     while(count<300){
           // some code that implements kernel function
     count++;
     }
     buffer_t.join();
     return 0;
}

Here we can see buffer_t has no relationship with kernel function. But when I remove code related to buffer_t, my kernel function runs much quicker.

putting a thread to sleep and waking it will result in slower cuda activity in that thread for a period of time after the wake. If you are doing this repeatedly, that could result in an observation of fairly continuous “slowness”.

Thanks for your reply, but I am executing the kernel funtion in the main function, while Sleep(20) is in another CPU thread.

And placing the kernel function in another CPU thread, instead of directly execting it in the main function, will also result in a very slow execution.

#include <thread>

int count=0;
void buffer_thread(){
    while (count < 640) {
     //Some code that implements the kernel function
     count++;
    }
}

int main(){
     std::thread buffer_t(buffer_thread);
     buffer_t.join();
     return 0;
}

However, directly execting my kernel function in the main function runs perfectly for me.

I wasn’t able to see any issue in my test case:

# cat t218.cu
#include <thread>
#include <iostream>
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

unsigned long long dtime_usec(unsigned long long start=0){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

template <typename T>
__global__ void k(const T *a, const T *b, T *c, int s){
  int idx = threadIdx.x+blockDim.x*blockIdx.x;
  int idy = threadIdx.y+blockDim.y*blockIdx.y;
  if ((idx < s) && (idy < s)){
    T val = 0;
    for (int i = 0; i < s; i++)
      val += a[idy*s+i] * b[i*s+idx];
    c[idy*s+idx] = val;
  }
}


int count=0;
void buffer_thread(){
    const int s = 1024;
    float *a, *b, *c;
    cudaMalloc(&a, s*s*sizeof(float));
    cudaMalloc(&b, s*s*sizeof(float));
    cudaMalloc(&c, s*s*sizeof(float));
    dim3 block(32,32);
    dim3 grid((s+block.x-1)/block.x, (s+block.y-1)/block.y);
    unsigned long long dt = dtime_usec(0);
    while (count < 640) {
     //Some code that implements the kernel function
      k<<<grid,block>>>(a, b, c, s);

     count++;
    }
    cudaDeviceSynchronize();
    dt = dtime_usec(dt);
    std::cout << "elapsed: " << dt/(float)USECPSEC << "s" << std::endl;
}

int main(){
#ifdef USE_THREAD
        std::thread buffer_t(buffer_thread);
     buffer_t.join();
#else
     buffer_thread();
#endif
     return 0;
}
# nvcc -o t218 t218.cu
# ./t218
elapsed: 0.80986s
# nvcc -o t218 t218.cu -DUSE_THREAD
# ./t218
elapsed: 0.812373s
#

CUDA 12.2, L4 GPU, linux

Thanks a lot. The lastet test results differ from my previous test results. It seems that the slowdown of my kernel function is indeed unrelated to CPU multi-threading. Thanks for your help.