NVIDIA Pascal GPUs is very slow on running CUDA Kernels when using cudaMallocManaged

I was testing the new CUDA 8 along with the Pascal Titan X GPU and is expecting speed up for my code but for some reason it ends up being slower. I am on Ubuntu 16.04.

Here is the minimum code that can reproduce the result

CUDASample.cuh

class CUDASample{
 public:
  void AddOneToVector(std::vector<int> &in);
};

CUDASample.cu

__global__ static void CUDAKernelAddOneToVector(int *data)
{
  const int x  = blockIdx.x * blockDim.x + threadIdx.x;
  const int y  = blockIdx.y * blockDim.y + threadIdx.y;
  const int mx = gridDim.x * blockDim.x;

  data[y * mx + x] = data[y * mx + x] + 1.0f;
}

void CUDASample::AddOneToVector(std::vector<int> &in){
  int *data;
  cudaMallocManaged(reinterpret_cast<void **>(&data),
                    in.size() * sizeof(int),
                    cudaMemAttachGlobal);

  for (std::size_t i = 0; i < in.size(); i++){
    data[i] = in.at(i);
  }

  dim3 blks(in.size()/(16*32),1);
  dim3 threads(32, 16);

  CUDAKernelAddOneToVector<<<blks, threads>>>(data);

  cudaDeviceSynchronize();

  for (std::size_t i = 0; i < in.size(); i++){
    in.at(i) = data[i];
  }

  cudaFree(data);
}

Main.cpp

std::vector<int> v;

for (int i = 0; i < 8192000; i++){
  v.push_back(i);
}

CUDASample cudasample;

cudasample.AddOneToVector(v);

The only difference is the NVCC flag, which for the Pascal Titan X is

-gencode arch=compute_61,code=sm_61-std=c++11;

and for the old Maxwell Titan X is

-gencode arch=compute_52,code=sm_52-std=c++11;

Here are the results for running NVIDIA Visual Profiling.

For the old Maxwell Titan, the time for memory transfer is around 205 ms, and the kernel launch is around 268 us.

For the Pascal Titan, the time for memory transfer is around 202 ms, and the kernel launch is around an insanely long 8343 us, which makes me believe something is wrong.

I further isolate the problem by replacing cudaMallocManaged into good old cudaMalloc and did some profiling and observe some interesting result.

CUDASample.cu

__global__ static void CUDAKernelAddOneToVector(int *data)
{
  const int x  = blockIdx.x * blockDim.x + threadIdx.x;
  const int y  = blockIdx.y * blockDim.y + threadIdx.y;
  const int mx = gridDim.x * blockDim.x;

  data[y * mx + x] = data[y * mx + x] + 1.0f;
}

void CUDASample::AddOneToVector(std::vector<int> &in){
  int *data;
  cudaMalloc(reinterpret_cast<void **>(&data), in.size() * sizeof(int));
  cudaMemcpy(reinterpret_cast<void*>(data),reinterpret_cast<void*>(in.data()), 
             in.size() * sizeof(int), cudaMemcpyHostToDevice);

  dim3 blks(in.size()/(16*32),1);
  dim3 threads(32, 16);

  CUDAKernelAddOneToVector<<<blks, threads>>>(data);

  cudaDeviceSynchronize();

  cudaMemcpy(reinterpret_cast<void*>(in.data()),reinterpret_cast<void*>(data), 
             in.size() * sizeof(int), cudaMemcpyDeviceToHost);

  cudaFree(data);
}

For the old Maxwell Titan, the time for memory transfer is around 5 ms both ways, and the kernel launch is around 264 us.

For the Pascal Titan, the time for memory transfer is around 5 ms both ways, and the kernel launch is around 194 us, which actually results in the performance increase I am hoping to see…

Why is Pascal GPU so slow on running CUDA kernels when cudaMallocManaged is used? It will be a travesty if I have to revert all my existing code that uses cudaMallocManaged into cudaMalloc. This experiment also shows that the memory transfer time using cudaMallocManaged is a lot slower than using cudaMalloc, which also feels like something is wrong. If using this results in a slow run time even the code is easier, this should be unacceptable because the whole purpose of using CUDA instead of plain C++ is to speed things up. What am I doing wrong and why am I observing this kind of result?

Just to clarify: By “new CUDA 8” you are referring to the final release of CUDA 8.0 that shipped on Wednesday 9/28/2016? And the comparison between Maxwell Titan X and Pascal Titan X used a controlled experiment (the exact same software was used, and the exact same host computer, only the GPU was swapped)?

Yes the final release of CUDA 8 that shipped on Wednesday. Exact same computer. Exact same code. Only GPU swapped.

Does the relative performance (Maxwell/Pascal) change much when you make the vector a lot longer, say 100M elements? The present length it should be big enough to avoid performance artifacts, but it would be useful to double-check that assumption.

If you are not currently running with the latest non-experimental driver available for your platform, I would suggest updating the driver. The nature of the issue suggests it could be driver related.

I have no hypothesis as to what could be going on here, and I don’t have the hardware to reproduce the observations. I do not see any obvious issues with your experiments. I would suggest waiting a couple days for feedback from forum participants, and if no good explanation emerges, filing a bug with NVIDIA (the bug reporting form is linked directly from the CUDA registered developer website).

There will be memory capacity issues if I allocate a memory chunk for 100M elements. I tried higher numbers that my computer can handle and it is showing similar results.

The driver I am using is the latest for linux, which is 367.44.

I suspect is has something to do with the new Unified memory framework as described in https://devblogs.nvidia.com/parallelforall/cuda-8-features-revealed/. It doesn’t copy the whole memory from CPU to GPU before the kernel launch, instead it’s trying to “cleverly” copying the data on demand, which in my experiment fails miserably.

I am confused that you state a vector of 100M raises “memory capacity issues”. Unless I mis-read the code, the vector used is a vector of 32-bit ‘int’ elements. So 100M elements would require 400 MB, which should fit easily on both the host and the device. What am I missing?

I figured out why a vector of 100M fails. I have to change the plus 1.0f in the kernel to plus 1.

When testing this with 100M vector, I am getting similar slow down for using cudaMallocManaged with the Pascal GPU. Still unclear why this is happening.