CUDA Graph: Limit on number of kernels that can be captured in CUDA Graph

Hi,

From https://forums.developer.nvidia.com/t/cuda-graph-in-cuda-fortran/136290/1 I have replaced to number of kernels to be recorded to a huge number, but all I am getting is segmentation fault. What is the limit on the number of kernels that I can record?

cuda_graph_m.f90:

 module cuda_graph
  use iso_c_binding

  ! enums

  enum, bind(C) ! cudaStreamCaptureMode
     enumerator :: cudaStreamCaptureModeGlobal=0
     enumerator :: cudaStreamCaptureModeThreadLocal=1
     enumerator :: cudaStreamCaptureModeRelaxed=2
  end enum

  ! types

  type cudaGraph
    type(c_ptr) :: graph
  end type cudaGraph

  type cudaGraphExec
    type(c_ptr) :: graph_exec
  end type cudaGraphExec

  type cudaGraphNode
    type(c_ptr) :: graph_node
  end type cudaGraphNode

  ! ---------
  ! functions
  ! ---------

  !----------------------------------
  ! Additional cudaStream functions
  !----------------------------------
  interface
     integer(c_int) function cudaStreamBeginCapture(stream, mode) &
          bind(C,name='cudaStreamBeginCapture')
       use cudafor
       integer(cuda_stream_kind), value ::  stream
       integer(c_int), value :: mode
     end function cudaStreamBeginCapture
  end interface

  interface
     integer(c_int) function cudaStreamEndCapture(stream, pGraph) &
          bind(C,name='cudaStreamEndCapture')
       use cudafor
       import cudaGraph
       integer(cuda_stream_kind), value ::  stream
       type(cudaGraph) :: pGraph
     end function cudaStreamEndCapture
  end interface


  !----------------------------------
  ! new cudaGraph functions
  !----------------------------------
  interface
     integer(c_int) function cudaGraphCreate(pGraph, flags) &
          bind(C,name='cudaGraphCreate')
       import cudaGraph
       type(cudaGraph) :: graph
       integer :: flags
     end function cudaGraphCreate
  end interface

  interface
     integer(c_int) function cudaGraphInstantiate(pGraphExec, graph, pErrorNode, pLogBuffer, bufferSize) &
          bind(C,name='cudaGraphInstantiate')
       use cudafor
       import cudaGraph
       import cudaGraphExec
       import cudaGraphNode
       type(cudaGraphExec) :: pGraphExec
       type(cudaGraph), value :: graph
       type(cudaGraphNode) :: pErrorNode
       character(kind=C_CHAR, len=*) :: pLogBuffer
       integer(c_size_t), value :: bufferSize
     end function cudaGraphInstantiate
  end interface

  interface
     integer(c_int) function cudaGraphLaunch(graphExec, stream) &
          bind(C,name='cudaGraphLaunch')
       use cudafor
       import cudaGraphExec
       type(cudaGraphExec), value :: graphExec
       integer(cuda_stream_kind), value ::  stream
     end function cudaGraphLaunch
  end interface

end module

main_graph.f90:

program main
  use cudafor
  use cuda_graph
  use iso_c_binding
  implicit none

  integer :: i, n, istat
  integer(kind=cuda_stream_kind) :: stream
  integer, device, allocatable :: a(:)
  type(cudaGraph) :: graph
  type(cudaGraphExec) :: graph_exec
  type(cudaGraphNode) :: error_node
  character(c_char) :: buffer
  integer(c_size_t) :: buffer_len

  istat = cudaStreamCreateWithFlags(stream, cudaStreamNonBlocking)
  print*, istat
  !istat = cudaGraphCreate(graph, 0)

  allocate(a(1024))

  istat = cudaStreamBeginCapture(stream, 0)
  print*, istat
  do n = 1,1e10
    !$cuf kernel do <<<*,*,stream=stream>>>
    do i = 1, 1024
      a(i) = 1234
    end do
  end do
  istat = cudaStreamEndCapture(stream, graph)
  print*, istat

  buffer_len = 0
  istat = cudaGraphInstantiate(graph_exec, graph, error_node, buffer, buffer_len)
  print*, istat

  istat = cudaGraphLaunch(graph_exec, stream)
  print*, istat

  istat = cudaStreamSynchronize(stream)


end program

Hi pcheechoung89936,

Sorry but I can’t find anything in the CUDA Graphs docs about a limit. Though what appears to be happening here is that with each pass of the graph more system memory is used. Eventually the OS kills the process once the process exceeds a memory limit. So the limit looks to be more a factor of how much system memory is available and the OS memory limits rather than a limit imposed by Graphs itself.

-Mat

Hi Mat,

I disagree. My workstation has 1TB of memory which I find it hard to believe that CUDA graph alone managed to overrun my system memory.

It seems to me that CUDA graph is not designed for a large number of small kernels. This defeats its initial purpose of amortizing many kernels launch overhead.

Regards,
Phoon

Hi Phoon,

I disagree. My workstation has 1TB of memory which I find it hard to believe that CUDA graph alone managed to overrun my system memory.

Ok, though I’m just stating what I observed when running your program on my system. Watching “top”, the program keeps increasing memory usage until the OS kills the program with a signal 9.

It seems to me that CUDA graph is not designed for a large number of small kernels. This defeats its initial purpose of amortizing many kernels launch overhead.

You’re speculating, as am I.

I’ll see if I can reach out to the developers of CUDA Graphs but feel free to ask your question on the NVIDIA DevTalk forums as well (https://devtalk.nvidia.com/). CUDA Graphs has been mostly adopted by ML developers. As far as I know, you’re one of the few folks I know of that use it from Fortran.

-Mat