cudaMemset too slow on Xavier

Hi there,
I’m a beginner for CUDA programming and now I’m doing some work on Jetson Xavier. I’d like to set a device memory buffer (around 800MB) using cudaMemset, but it shows that this took about 10ms. I tried to replace cudaMemset with my own-written kernel code but the time taken is the same. Is there any reason why this operation is so slow? Any advice on this? Thanks a lot!


Have you maximized the device performance first?

$ sudo nvpmodel -m 0
$ sudo jetson_clocks

If the issue goes on, please help to share a simple reproducible source for us checking.

Yes, I have used the commands above.

My code is like this:

int main(int argc, char **argv) {
  cudaError_t cerr = cudaSuccess;

  std::chrono::high_resolution_clock::time_point start, end;
  double cudamemset_time_ms = 0.0f;
  double memset_kernel_time_ms = 0.0f;

  int height = 320 * 1040;
  int width = 640;
  int N = height * width;

  int *h_array = (int *)malloc(sizeof(int) * N);
  for (int i = 0; i < N; ++i) {
    h_array[i] = rand();

  int *d_array;
  cerr = cudaMalloc(&d_array, sizeof(int) * N);
  cerr = cudaMemcpy(d_array, h_array, sizeof(int) * N, cudaMemcpyHostToDevice);

  start = std::chrono::high_resolution_clock::now();

  cerr = cudaMemset(d_array, 0, sizeof(int) * N);

  end = std::chrono::high_resolution_clock::now();
  auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();
  cudamemset_time_ms = static_cast<float>(duration) / 1000.0f;

  start = std::chrono::high_resolution_clock::now();

  size_t bs_mk_x = 16, bs_mk_y = 16;
  size_t gs_mk_x = std::ceil(1.0f * width / bs_mk_x);
  size_t gs_mk_y = std::ceil(1.0f * height / bs_mk_y);
  dim3 bs_mk(bs_mk_x, bs_mk_y);
  dim3 gs_mk(gs_mk_x, gs_mk_y);
  memset_kernel<<<gs_mk, bs_mk>>>(height, width, d_array);

  end = std::chrono::high_resolution_clock::now();
  duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start).count();
  memset_kernel_time_ms = static_cast<float>(duration) / 1000.0f;


  printf("\n=============== PROFILING INFO (ms) ===============\n");
  printf("[cudamemset]               %f\n", cudamemset_time_ms);
  printf("[memset_kernel]            %f\n", memset_kernel_time_ms);

  return 0;

The kernel code is as follows:

global void memset_kernel(const int height, const int width, int *array) {
const int ix = blockIdx.x * blockDim.x + threadIdx.x;
const int iy = blockIdx.y * blockDim.y + threadIdx.y;

if (ix < width && iy < height) {
const int tid = ix + iy * width;
array[tid] = 0;

After compiling, the executed output is as follows (the above HPC commands are pre-executed):

=============== PROFILING INFO (ms) ===============
[cudamemset] 10.029000
[memset_kernel] 13.189000

The time taken is ~10ms. Is there any solution for this?

Any update for this issue?


Sorry for the late update.

We have some example that demonstrate the usage of cudaMemset.
Would you mind to give it a try first?



I have not found any example for cudaMemset, but one for cudaMemcpy instead. Cound you please test my demo code above to have a check?