NVIDIA Pascal GPUs is very slow on running CUDA Kernels when using cudaMallocManaged

JasonJuang · September 30, 2016, 8:53pm

I was testing the new CUDA 8 along with the Pascal Titan X GPU and is expecting speed up for my code but for some reason it ends up being slower. I am on Ubuntu 16.04.

Here is the minimum code that can reproduce the result

CUDASample.cuh

class CUDASample{
 public:
  void AddOneToVector(std::vector<int> &in);
};

CUDASample.cu

__global__ static void CUDAKernelAddOneToVector(int *data)
{
  const int x  = blockIdx.x * blockDim.x + threadIdx.x;
  const int y  = blockIdx.y * blockDim.y + threadIdx.y;
  const int mx = gridDim.x * blockDim.x;

  data[y * mx + x] = data[y * mx + x] + 1.0f;
}

void CUDASample::AddOneToVector(std::vector<int> &in){
  int *data;
  cudaMallocManaged(reinterpret_cast<void **>(&data),
                    in.size() * sizeof(int),
                    cudaMemAttachGlobal);

  for (std::size_t i = 0; i < in.size(); i++){
    data[i] = in.at(i);
  }

  dim3 blks(in.size()/(16*32),1);
  dim3 threads(32, 16);

  CUDAKernelAddOneToVector<<<blks, threads>>>(data);

  cudaDeviceSynchronize();

  for (std::size_t i = 0; i < in.size(); i++){
    in.at(i) = data[i];
  }

  cudaFree(data);
}

Main.cpp

std::vector<int> v;

for (int i = 0; i < 8192000; i++){
  v.push_back(i);
}

CUDASample cudasample;

cudasample.AddOneToVector(v);

The only difference is the NVCC flag, which for the Pascal Titan X is

-gencode arch=compute_61,code=sm_61-std=c++11;

and for the old Maxwell Titan X is

-gencode arch=compute_52,code=sm_52-std=c++11;

Here are the results for running NVIDIA Visual Profiling.

For the old Maxwell Titan, the time for memory transfer is around 205 ms, and the kernel launch is around 268 us.

For the Pascal Titan, the time for memory transfer is around 202 ms, and the kernel launch is around an insanely long 8343 us, which makes me believe something is wrong.

I further isolate the problem by replacing cudaMallocManaged into good old cudaMalloc and did some profiling and observe some interesting result.

CUDASample.cu

__global__ static void CUDAKernelAddOneToVector(int *data)
{
  const int x  = blockIdx.x * blockDim.x + threadIdx.x;
  const int y  = blockIdx.y * blockDim.y + threadIdx.y;
  const int mx = gridDim.x * blockDim.x;

  data[y * mx + x] = data[y * mx + x] + 1.0f;
}

void CUDASample::AddOneToVector(std::vector<int> &in){
  int *data;
  cudaMalloc(reinterpret_cast<void **>(&data), in.size() * sizeof(int));
  cudaMemcpy(reinterpret_cast<void*>(data),reinterpret_cast<void*>(in.data()), 
             in.size() * sizeof(int), cudaMemcpyHostToDevice);

  dim3 blks(in.size()/(16*32),1);
  dim3 threads(32, 16);

  CUDAKernelAddOneToVector<<<blks, threads>>>(data);

  cudaDeviceSynchronize();

  cudaMemcpy(reinterpret_cast<void*>(in.data()),reinterpret_cast<void*>(data), 
             in.size() * sizeof(int), cudaMemcpyDeviceToHost);

  cudaFree(data);
}

For the old Maxwell Titan, the time for memory transfer is around 5 ms both ways, and the kernel launch is around 264 us.

For the Pascal Titan, the time for memory transfer is around 5 ms both ways, and the kernel launch is around 194 us, which actually results in the performance increase I am hoping to see…

Why is Pascal GPU so slow on running CUDA kernels when cudaMallocManaged is used? It will be a travesty if I have to revert all my existing code that uses cudaMallocManaged into cudaMalloc. This experiment also shows that the memory transfer time using cudaMallocManaged is a lot slower than using cudaMalloc, which also feels like something is wrong. If using this results in a slow run time even the code is easier, this should be unacceptable because the whole purpose of using CUDA instead of plain C++ is to speed things up. What am I doing wrong and why am I observing this kind of result?

njuffa · September 30, 2016, 9:08pm

Just to clarify: By “new CUDA 8” you are referring to the final release of CUDA 8.0 that shipped on Wednesday 9/28/2016? And the comparison between Maxwell Titan X and Pascal Titan X used a controlled experiment (the exact same software was used, and the exact same host computer, only the GPU was swapped)?

JasonJuang · September 30, 2016, 9:30pm

Yes the final release of CUDA 8 that shipped on Wednesday. Exact same computer. Exact same code. Only GPU swapped.

njuffa · September 30, 2016, 9:54pm

Does the relative performance (Maxwell/Pascal) change much when you make the vector a lot longer, say 100M elements? The present length it should be big enough to avoid performance artifacts, but it would be useful to double-check that assumption.

If you are not currently running with the latest non-experimental driver available for your platform, I would suggest updating the driver. The nature of the issue suggests it could be driver related.

I have no hypothesis as to what could be going on here, and I don’t have the hardware to reproduce the observations. I do not see any obvious issues with your experiments. I would suggest waiting a couple days for feedback from forum participants, and if no good explanation emerges, filing a bug with NVIDIA (the bug reporting form is linked directly from the CUDA registered developer website).

JasonJuang · September 30, 2016, 10:09pm

There will be memory capacity issues if I allocate a memory chunk for 100M elements. I tried higher numbers that my computer can handle and it is showing similar results.

The driver I am using is the latest for linux, which is 367.44.

I suspect is has something to do with the new Unified memory framework as described in https://devblogs.nvidia.com/parallelforall/cuda-8-features-revealed/. It doesn’t copy the whole memory from CPU to GPU before the kernel launch, instead it’s trying to “cleverly” copying the data on demand, which in my experiment fails miserably.

njuffa · October 1, 2016, 12:22am

I am confused that you state a vector of 100M raises “memory capacity issues”. Unless I mis-read the code, the vector used is a vector of 32-bit ‘int’ elements. So 100M elements would require 400 MB, which should fit easily on both the host and the device. What am I missing?

JasonJuang · October 4, 2016, 5:51pm

I figured out why a vector of 100M fails. I have to change the plus 1.0f in the kernel to plus 1.

When testing this with 100M vector, I am getting similar slow down for using cudaMallocManaged with the Pascal GPU. Still unclear why this is happening.

Topic		Replies	Views
cudaMalloc(Pitch) _significantly_ slower on windows with Geforce drivers > 350.12 CUDA Programming and Performance	10	2551	February 10, 2017
cudaMallocManaged and CUDA 8.0 CUDA Programming and Performance	5	2526	June 21, 2018
Why is the Kernel faster when my matrices are not initialized CUDA Programming and Performance	2	738	December 18, 2017
CUDA very slow performance CUDA Programming and Performance	21	16566	March 6, 2020
Why is this slow CUDA Programming and Performance	7	3731	February 7, 2012
Odd performance problem/question CUDA Programming and Performance	3	832	June 3, 2009
Cuda code performance CUDA Programming and Performance	14	3119	December 16, 2014
An Even Easier Introduction to CUDA Technical Blog	141	6196	November 28, 2023
Porting from Maxwell (TITAN X) to Pascal (GTX 1080) unspecified kernel launch error CUDA Programming and Performance	6	2248	July 15, 2016
Pascal Titan X benchmark thread CUDA Programming and Performance	19	4610	August 12, 2016

NVIDIA Pascal GPUs is very slow on running CUDA Kernels when using cudaMallocManaged

Related topics