Need Assistance in Optimizing CUDA Execution for Siamrpn tracker code

Hello Nvidia Community,

I’m currently working on optimizing the execution time of a Siamrpn tracker code by leveraging CUDA and GPU acceleration on the Jetson device. I’ve rewritten sections of the code that previously utilized CPU and NumPy to now operate CUDA, aiming to harness the power of GPU processing.

However, after these modifications, I haven’t observed any significant improvement in execution time. The changes seem to only affect the execution time of various parts of the code, while the total time remains unaffected.

Below is a snippet of the code in question:

def _convert_bbox(self, delta, anchor):
    delta = delta.permute(1, 2, 3, 0).contiguous().view(4, -1)
    delta = delta.data.cpu().numpy()

    delta[0, :] = delta[0, :] * anchor[:, 2] + anchor[:, 0]
    delta[1, :] = delta[1, :] * anchor[:, 3] + anchor[:, 1]
    delta[2, :] = np.exp(delta[2, :]) * anchor[:, 2]
    delta[3, :] = np.exp(delta[3, :]) * anchor[:, 3]
    return delta

def _convert_score(self, score):
    score = score.permute(1, 2, 3, 0).contiguous().view(2, -1).permute(1, 0)
    score = F.softmax(score, dim=1).data[:, 1].detach().cpu().numpy()
    return score

def _bbox_clip(self, cx, cy, width, height, boundary):
    cx = max(0, min(cx, boundary[1]))
    cy = max(0, min(cy, boundary[0]))
    width = max(10, min(width, boundary[1]))
    height = max(10, min(height, boundary[0]))
    return cx, cy, width, height
    
def track(self, img):
    """
    args:
        img(np.ndarray): BGR image
    return:
        bbox(list):[x, y, width, height]
    """
    w_z = self.size[0] + cfg.TRACK.CONTEXT_AMOUNT * np.sum(self.size)
    h_z = self.size[1] + cfg.TRACK.CONTEXT_AMOUNT * np.sum(self.size)
    s_z = np.sqrt(w_z * h_z)
    scale_z = cfg.TRACK.EXEMPLAR_SIZE / s_z
    s_x = s_z * (cfg.TRACK.INSTANCE_SIZE / cfg.TRACK.EXEMPLAR_SIZE)
    
    #print("Tracking...")
    
    # x_crop = self.get_subwindow(img, self.center_pos,
    #                             cfg.TRACK.INSTANCE_SIZE,
    #                             round(s_x), self.channel_average)
    a = time.time()
    x_crop = get_subwindow_tracking_(img, self.center_pos,
                                cfg.TRACK.INSTANCE_SIZE,
                                round(s_x), self.channel_average).cuda().unsqueeze(0)
    b = time.time()
    print("get_subwindow_tracking time: " + str(b - a))
    outputs = self.model.track(x_crop)

    score = self._convert_score(outputs['cls'])
    pred_bbox = self._convert_bbox(outputs['loc'], self.anchors)

    def change(r):
        return np.maximum(r, 1. / r)

    def sz(w, h):
        pad = (w + h) * 0.5
        return np.sqrt((w + pad) * (h + pad))

    # scale penalty
    s_c = change(sz(pred_bbox[2, :], pred_bbox[3, :]) /
                 (sz(self.size[0]*scale_z, self.size[1]*scale_z)))

    # aspect ratio penalty
    r_c = change((self.size[0]/self.size[1]) /
                 (pred_bbox[2, :]/pred_bbox[3, :]))
    penalty = np.exp(-(r_c * s_c - 1) * cfg.TRACK.PENALTY_K)
    pscore = penalty * score

    # window penalty
    pscore = pscore * (1 - cfg.TRACK.WINDOW_INFLUENCE) + \
        self.window * cfg.TRACK.WINDOW_INFLUENCE
    best_idx = np.argmax(pscore)

    bbox = pred_bbox[:, best_idx] / scale_z
    lr = penalty[best_idx] * score[best_idx] * cfg.TRACK.LR

    cx = bbox[0] + self.center_pos[0]
    cy = bbox[1] + self.center_pos[1]

    # smooth bbox
    width = self.size[0] * (1 - lr) + bbox[2] * lr
    height = self.size[1] * (1 - lr) + bbox[3] * lr

    # clip boundary
    cx, cy, width, height = self._bbox_clip(cx, cy, width,
                                            height, img.shape[:2])

    # udpate state
    self.center_pos = np.array([cx, cy])
    self.size = np.array([width, height])

    bbox = [cx - width / 2,
            cy - height / 2,
            width,
            height]
    best_score = score[best_idx]
    return {
            'bbox': bbox,
            'best_score': best_score
           }

My goal is to run these specific sections of the code on CUDA for faster execution. Could you please guide me on how to effectively utilize CUDA for these computations?

Thank you in advance for your assistance.

Best regards,

Hi,

Could you share more about how you modify the code to GPU implementation?
Do you use CuPy or other GPU-based libraries?

Thanks.

This topic was automatically closed 14 days after the last reply. New replies are no longer allowed.