Warp library what can be done to big input arrays to keep differentiability

jakub.mitura14 · July 7, 2022, 10:15am

Hello I work on some computer vision task and I would like to ask about the statement from documentation in regards to preserving possibility of auto differentiation with tape

Kernels should not overwrite any previously used array values except to perform simple linear add/subtract operations

Generally in order to get any output from the function one need to mutate the entries, I seen in examples that for example that this mutation can also include dot product operations, so what is an example of the operation one should not try, for example in a context of simulating 3d particle grid interaction simulation
In case the array would be to big to fit in gpu memory is there some batching supported? I am aware that it will reduce performance, still sometimes can not be avoided.

jakub.mitura14 · July 9, 2022, 6:29am

ok about what can be and can not be done basically this tests in your repository answers

github.com

NVIDIA/warp/blob/0fc9d81a8c0cb98f08009dfe0e665930756c17df/warp/tests/test_grad.py

# Copyright (c) 2022 NVIDIA CORPORATION.  All rights reserved.
# NVIDIA CORPORATION and its licensors retain all intellectual property
# and proprietary rights in and to this software, related documentation
# and any modifications thereto.  Any use, reproduction, disclosure or
# distribution of this software and related documentation without an express
# license agreement from NVIDIA CORPORATION is strictly prohibited.

import numpy as np
import warp as wp
from warp.tests.test_base import *

wp.init()

@wp.kernel
def scalar_grad(x: wp.array(dtype=float),
                y: wp.array(dtype=float)):

    y[0] = x[0]**2.0

This file has been truncated. show original

about batching I suppose answer is not, but still I suppose I can use pytorch for this

mati-nvidia · July 11, 2022, 11:15pm

Hi @jakub.mitura14! I reached out to the devs to see if they have anything more to add.

milesmacklin · July 11, 2022, 11:19pm

Hi Jakub,

For auto-diff for work we generally need to preserve inputs so that the backwards pass can work. There are some cases we can overwrite previous values, but only when subsequent operations are linear. e.g.: doing array[i] = log(array[i]), will not work (doing this in the kernel is fine though e.g.: y = log(y), is of course fine).

We are working on some documentation improvements to make these points more clear.

Regarding batching, you may have seen we recently added multidimensional array support. This can be useful to help implement batching in general, let me know if you need something else specific.

Cheers,
Miles

jakub.mitura14 · July 12, 2022, 9:05am

Fantastic thank you @milesmacklin for taking time to respond me!, ok about batching yest it solves a problem.
Still something about gradient flow is not clear to me.

Now in order to check weather I understand it correctly first I will add context I am creating some model that is processing neighborhood on 3d MRI image.
layer1: given neighborhood and multiple matrices that represent parameters of a model it will return decision to mark neighborhood
implicit gpu sync
layer2: deterministic layer - on the basis of the decision output from layer one adjust labels

layer1 and layer2 woud be invoked in loop, gradients of layer one should be calculated in respect to layer two outpu - I divide logic into two layers only to allow gpu sync

loss function outputting float

Now parameters for each voxel=thread are the same and jointly trained (there may be millions of voxels)

Basically all logic would be hidden inside the kernels .

Now when I get the gradients from tape I would like to optimize the parameters in respect to loss function - that will take the output of the last layer and output the float.

Now this model is quite typical, still I do not understand exactly how gradient flows for example:

when I will call the tape backwards in this case when all threads had different data and output but share the same parameters I will get some average gradient that I can use to gradient descent my parameters?
I did not seen any reduction utilities in library and I need them in the loss function , and I want to avoid large scale atomic operation to preserve performance - so I planned to convert the wp array constituting output of last layer into pytorch array and the use operations like filter sum etc - would it brake back propagation flow?

Thank You !

milesmacklin · July 14, 2022, 12:41am

Hi Jakub,

When you call tape backward it will replay the kernels in reverse order, and yes, it will accumulate gradients onto the model parameters for each launch. This is the same as all backpropagation libraries. The main thing to make sure of is that you don’t overwrite previous results with new ones during the tape capture, i.e.: each layer should output to a new array of output values, before passing to the next layer.

For your second question, yes we typically rely on atomic operations for reductions, which are quite performant for reasonable values (may 10^5 threads). You can definitely use other libraries to perform the parallel reduction if you like. It requires ‘stictching’ the gradients together, take a look at example_sim_fk_grad_torch.py for an example of how to interop gradients between frameworks.

Cheers,
Miles

jakub.mitura14 · July 14, 2022, 5:37am

Thanks @milesmacklin ! about keeping the outputs not overwritten in this usecase is hard to achieve - can one instead wrap the layer in a class and store gradients as a class variable and then use gradients values in backpropagation?

calculate gradients during forward pass
store gradients as class variable
overwrite outputs
use stored gradients in back propagation

jakub.mitura14 · July 24, 2022, 4:26pm

reconsidering I suppose it is impossible as gradients are relative
However I have the case that I invoke kernel multiple times, and both input and outputs are boolean’s
Hence I could compress the data using bit operations

encode voxel from iteration 1 in bit 1 of int32 from iteration two as second bit … and in backpropagation
I could recreate those, so no information is lost

Is it possible to achieve @milesmacklin ?

Topic		Replies	Views
Just Released: Warp 1.10 Expands JAX Interoperability and Performance Technical Blog	1	85	November 13, 2025
Warp to Jax example General Discussion	3	1042	March 30, 2023
about the __syncwarp() in P100 CUDA Programming and Performance	11	4221	June 6, 2018
Several threads attacking the same position. Superposition in that position. CUDA Programming and Performance	13	1609	October 19, 2010
How to sync data computed from the api "cudnnBatchNormalizationBackward" cuDNN	0	427	September 16, 2019
Parse warp array to google jax General Discussion	3	492	December 13, 2022
Determinism with mixed precision cuDNN pytorch	1	741	September 20, 2021
Using CUDA Warp-Level Primitives Technical Blog	20	2219	April 15, 2024
TensorRT Batching Speed scales poorly TensorRT tensorrt , cuda	6	1898	September 30, 2021
Several threads attacking the same position. Superposition in that position. CUDA Programming and Performance	31	13990	October 19, 2010

Warp library what can be done to big input arrays to keep differentiability

Related topics