deviceId/contextId fields in CUpti_ActivityMemcpyPtoP4 struct for device to device transfers

Heya,

I’m in the process of rewriting the CUDA parts of our application for newer CUDA versions to support features added since CUDA 7.0 ( which basically means: A lot ). Right now, I’m looking into how memory transfers are handled by CUPTI and how tools can work with this.

Here’s what I’ve noticed:

If I try to transfer data between two devices via CUDA and don’t have Peer-2-Peer transfers enabled, I’ll get activity records with the kind CUPTI_ACTIVITY_KIND_MEMCPY, indicating D2H->H2D. With this, we can recreate the data transfer completely.

However, recreating actual P2P transfers seem to be a bit more complicated. The struct being passed is shown below.

/**
 * \brief The activity record for peer-to-peer memory copies.
 *
 * This activity record represents a peer-to-peer memory copy
 * (CUPTI_ACTIVITY_KIND_MEMCPY2).
 */
typedef struct PACKED_ALIGNMENT {
  /**
   * The activity record kind, must be CUPTI_ACTIVITY_KIND_MEMCPY2.
   */
  CUpti_ActivityKind kind;

  /**
   * The kind of the memory copy, stored as a byte to reduce record
   * size.  \see CUpti_ActivityMemcpyKind
   */
  uint8_t copyKind;

  /**
   * The source memory kind read by the memory copy, stored as a byte
   * to reduce record size.  \see CUpti_ActivityMemoryKind
   */
  uint8_t srcKind;

  /**
   * The destination memory kind read by the memory copy, stored as a
   * byte to reduce record size.  \see CUpti_ActivityMemoryKind
   */
  uint8_t dstKind;

  /**
   * The flags associated with the memory copy. \see
   * CUpti_ActivityFlag
   */
  uint8_t flags;

  /**
   * The number of bytes transferred by the memory copy.
   */
  uint64_t bytes;

  /**
   * The start timestamp for the memory copy, in ns. A value of 0 for
   * both the start and end timestamps indicates that timestamp
   * information could not be collected for the memory copy.
   */
  uint64_t start;

  /**
   * The end timestamp for the memory copy, in ns. A value of 0 for
   * both the start and end timestamps indicates that timestamp
   * information could not be collected for the memory copy.
   */
  uint64_t end;

  /**
  * The ID of the device where the memory copy is occurring.
  */
  uint32_t deviceId;

  /**
   * The ID of the context where the memory copy is occurring.
   */
  uint32_t contextId;

  /**
   * The ID of the stream where the memory copy is occurring.
   */
  uint32_t streamId;

  /**
   * The ID of the device where memory is being copied from.
   */
  uint32_t srcDeviceId;

  /**
   * The ID of the context owning the memory being copied from.
   */
  uint32_t srcContextId;

  /**
   * The ID of the device where memory is being copied to.
   */
  uint32_t dstDeviceId;

  /**
   * The ID of the context owning the memory being copied to.
   */
  uint32_t dstContextId;

  /**
   * The correlation ID of the memory copy. Each memory copy is
   * assigned a unique correlation ID that is identical to the
   * correlation ID in the driver and runtime API activity record that
   * launched the memory copy.
   */
  uint32_t correlationId;

#ifndef CUPTILP64
  /**
   * Undefined. Reserved for internal use.
   */
  uint32_t pad;
#endif

  /**
   * Undefined. Reserved for internal use.
   */
  void *reserved0;

  /**
   * The unique ID of the graph node that executed the memcpy through graph launch.
   * This field will be 0 if memcpy is not done using graph launch.
   */
  uint64_t graphNodeId;

  /**
   * The unique ID of the graph that executed this memcpy through graph launch.
   * This field will be 0 if the memcpy is not done through graph launch.
   */
  uint32_t graphId;

  /**
   * The ID of the HW channel on which the memory copy is occurring.
   */
  uint32_t channelID;

  /**
   * The type of the channel
   */
  CUpti_ChannelType channelType;
} CUpti_ActivityMemcpyPtoP4;

There’s a lot of information stored in this struct. Interestingly, there are several fields for devices and contexts:

  • deviceId / srcDeviceId / dstDeviceId
  • contextId / srcContextId / dstContextId

while there is only a single field for the stream

  • streamId

Looking at this struct, I’m asking myself if it is possible to have deviceId != srcDeviceId && deviceId != dstDeviceId (the same for contexts), i.e. some other device triggers a transfer between two devices. If that would be the case, I would have to figure out some way to handle this in our application, since we write everything to locations (in this case represented by CUDA streams) and cannot simply write to any stream, since all events per stream need to be written in ascending timestamp order during program execution.

Thanks a lot,
Jan