Warp library what can be done to big input arrays to keep differentiability

Hello I work on some computer vision task and I would like to ask about the statement from documentation in regards to preserving possibility of auto differentiation with tape

Kernels should not overwrite any previously used array values except to perform simple linear add/subtract operations
  1. Generally in order to get any output from the function one need to mutate the entries, I seen in examples that for example that this mutation can also include dot product operations, so what is an example of the operation one should not try, for example in a context of simulating 3d particle grid interaction simulation

  2. In case the array would be to big to fit in gpu memory is there some batching supported? I am aware that it will reduce performance, still sometimes can not be avoided.

ok about what can be and can not be done basically this tests in your repository answers

about batching I suppose answer is not, but still I suppose I can use pytorch for this

Hi @jakub.mitura14! I reached out to the devs to see if they have anything more to add.

Hi Jakub,

For auto-diff for work we generally need to preserve inputs so that the backwards pass can work. There are some cases we can overwrite previous values, but only when subsequent operations are linear. e.g.: doing array[i] = log(array[i]), will not work (doing this in the kernel is fine though e.g.: y = log(y), is of course fine).

We are working on some documentation improvements to make these points more clear.

Regarding batching, you may have seen we recently added multidimensional array support. This can be useful to help implement batching in general, let me know if you need something else specific.

Cheers,
Miles

1 Like

Fantastic thank you @milesmacklin for taking time to respond me!, ok about batching yest it solves a problem.
Still something about gradient flow is not clear to me.

Now in order to check weather I understand it correctly first I will add context I am creating some model that is processing neighborhood on 3d MRI image.
layer1: given neighborhood and multiple matrices that represent parameters of a model it will return decision to mark neighborhood
implicit gpu sync
layer2: deterministic layer - on the basis of the decision output from layer one adjust labels

layer1 and layer2 woud be invoked in loop, gradients of layer one should be calculated in respect to layer two outpu - I divide logic into two layers only to allow gpu sync

loss function outputting float

Now parameters for each voxel=thread are the same and jointly trained (there may be millions of voxels)

Basically all logic would be hidden inside the kernels .

Now when I get the gradients from tape I would like to optimize the parameters in respect to loss function - that will take the output of the last layer and output the float.

Now this model is quite typical, still I do not understand exactly how gradient flows for example:

  1. when I will call the tape backwards in this case when all threads had different data and output but share the same parameters I will get some average gradient that I can use to gradient descent my parameters?
  2. I did not seen any reduction utilities in library and I need them in the loss function , and I want to avoid large scale atomic operation to preserve performance - so I planned to convert the wp array constituting output of last layer into pytorch array and the use operations like filter sum etc - would it brake back propagation flow?

Thank You !

Hi Jakub,

When you call tape backward it will replay the kernels in reverse order, and yes, it will accumulate gradients onto the model parameters for each launch. This is the same as all backpropagation libraries. The main thing to make sure of is that you don’t overwrite previous results with new ones during the tape capture, i.e.: each layer should output to a new array of output values, before passing to the next layer.

For your second question, yes we typically rely on atomic operations for reductions, which are quite performant for reasonable values (may 10^5 threads). You can definitely use other libraries to perform the parallel reduction if you like. It requires ‘stictching’ the gradients together, take a look at example_sim_fk_grad_torch.py for an example of how to interop gradients between frameworks.

Cheers,
Miles

1 Like

Thanks @milesmacklin ! about keeping the outputs not overwritten in this usecase is hard to achieve - can one instead wrap the layer in a class and store gradients as a class variable and then use gradients values in backpropagation?

  1. calculate gradients during forward pass
  2. store gradients as class variable
  3. overwrite outputs
  4. use stored gradients in back propagation

reconsidering I suppose it is impossible as gradients are relative
However I have the case that I invoke kernel multiple times, and both input and outputs are boolean’s
Hence I could compress the data using bit operations

encode voxel from iteration 1 in bit 1 of int32 from iteration two as second bit … and in backpropagation
I could recreate those, so no information is lost

Is it possible to achieve @milesmacklin ?