Back-propagation Demystified [Part 2]

This is a continuation of the back-propagation demystified series. In this post I’ll cover computational graphs in PyTorch.

All the deep learning frameworks rely on the creation of computation graphs for the calculation of gradient values required for the gradient descent optimization. Generally you have to build the forward propagation graph and the framework takes care of the backward differentiation for you.

But before starting with computational graphs in PyTorch, I want to discuss about static and dynamic computational graphs.

Static computational graphs:

These typically involve two phases as follows.

  • Phase 1: Define an architecture ( maybe with some primitive flow control like loops and conditionals)
  • Phase 2: Run a bunch of data through it to train the model and/or make predictions

One of the advantages of static graphs is that it allows for powerful offline optimization/scheduling of graphs. The disadvantage is that handling structured and even variable-sized data is ugly.

Dynamic computational graphs:

The graph is defined implicitly (e.g., using operator overloading) as the forward computation is executed.

Dynamic graphs have the advantage of being more flexible. The library is less invasive and allows for interleaved construction and evaluation of the graph. The forward computation is written in your favorite programming language with all its features and algorithms. The downside is that there is little time for graph optimization and if the graph does not change, effort can be wasted.

PyTorch uses dynamic computational graphs. Tensorflow uses static computational graphs. However, Eager Execution in Tensorflow allows for something similar to dynamic graphs. It is an imperative programming environment that evaluates operations immediately, without building graphs, operations return concrete values instead of constructing a computational graph to run later.

Computational Graphs in PyTorch

Let us begin with computational graphs in PyTorch.

At its core PyTorch provides two features:

  1. An n-dimensional Tensor, similar to numpy but can run on GPUs.
  2. Automatic differentiation for building and training neural networks.

Deep learning architectures and their training involves a lot of matrix operations. A Tensor is nothing but an n dimensional array. For people coming from Python background, Numpy should ring a bell. It is an extremely powerful and optimized library for matrix operations. However, for deep learning purposes, the matrices are huge and require enormous computational power.

A PyTorch Tensor it nothing but an n-dimensional array. The framework provides a lot of functions for operating on these Tensors. But to accelerate the numerical computations for Tensors, PyTorch allows the utilization of GPUs, which can provide speedups of 50x or greater. PyTorch Tensors can also keep track of a computational graph and gradients.

In PyTorch the autograd package provides automatic differentiation to automate the computation of the backward passes in neural networks. The forward pass of your network defines the computational graph; nodes in the graph are Tensors and edges are functions that produced the output Tensors from input Tensors. Back-propagation through this graph then gives the gradients.

Every Tensor in PyTorch has a flag: required_grad that allows for fine grained exclusion of subgraphs from gradient computation and can increase efficiency. If x is a Tensor that has x.requires_grad=True then x.grad is another Tensor holding the gradient of x with respect to some scalar value.

import torch

x = torch.randn(3,3) #requires_grad=False by default
y = torch.randn(3,3) #requires_grad=False by default
z = torch.randn((3,3),requires_grad=True)

a = x+y # since both x and y don't require gradients, a also doesn't require gradients
print(a.requires_grad) #output: False

b = a+z #since z requires gradient, b also requires gradient
print(b.requires_grad) #output: True

As seen from the above example, if there is a single input to an operation that requires gradient, its output will also require gradient. Conversely, only if all inputs don’t require gradient, the output also won’t require it.

Autograd under the hood:

Conceptually, autograd keeps a graph recording of all of the operations that created the data as you execute operations, giving you a directed acyclic graph whose leaves are the input tensors and roots are the output tensors. By tracing this graph from roots to leaves, you can automatically compute the gradients using the chain rule (back-propagation) .

Internally, autograd represents this graph as a graph of Function objects, which can be apply()-ed to compute the result of evaluating the graph. When computing the forward pass, autograd simultaneously performs the requested computations and builds up a graph representing the function that computes the gradient (the .grad_fn attribute of each torch.Tensor is an entry point into this graph). When the forward pass completed, the graph is evaluated in the backwards pass to compute the gradients.

As discussed earlier the computational graphs in PyTorch are dynamic and thus are recreated from scratch at every iteration, and this is exactly what allows for using arbitrary Python control flow statements that can change the overall shape and size of the graph at every iteration. You don’t have to encode all possible paths before you launch the training – what you run is what you differentiate.

Every primitive autograd operator is two functions that operate on Tensors. The forward function computes output Tensors from input Tensors. The backward function receives the gradient of the output Tensors with respect to some scalar and computes the gradient of the input Tensors with respect to that same scalar.

To summarize, Tensor and Function are interconnected and build up an acyclic graph, that encodes a complete history of the computation. Each tensor has a .grad_fn attribute that references a Function that has created the Tensor (except for Tensors created by the user since their grad_fn is None). If you want to compute the derivatives, you can call .backward() on a Tensor. After call to the backwards function the gradient values are stored as tensors in grad attribute.

These concepts can be represented as following diagram.

So for example if you create two Tensors a and b. Followed by c = a/b. The grad_fn of c would be DivBackward which is the backward function for the / operator. And as discussed earlier a collection of these grad_fn makes the backward graph. The forward and backward function are a member of torch.autograd.Function. You can define your own autograd operator by defining a subclass of torch.autograd.Function.

is_leaf: All Tensors that have requires_grad which is False are leaf Tensors by convention. For Tensors that have requires_grad with is True, they will be leaf Tensors if they were created by the user. This means that they are not the result of an operation and so grad_fn is None. Only leaf Tensors have their grad populated during a call to backward(). To get grad populated for non-leaf Tensors, you can use retain_grad().

Let us construct the computational graph example used in part 1 of the post and use PyTorch to compute the gradients.

import torch
from IPython.display import display, Math
# Define the graph a,b,c,d are leaf nodes and e is the root node
# The graph is constructed with every line since the 
# computational graphs are dynamic in PyTorch

a = torch.tensor([2.0],requires_grad=True)
b = torch.tensor([3.0],requires_grad=True)
c = torch.tensor([5.0],requires_grad=True)
d = torch.tensor([10.0],requires_grad=True)
u = a*b
t = torch.log(d)
v = t*c
t.retain_grad()
e = u+v

The above code constructs the computational graph in PyTorch. Let us look at few properties of the nodes in the graph.

The leaves don’t have grad_fn but will have gradients. Non leaf nodes have grad_fn but don’t have gradients. Before the backward() is called there are no grad values.

The gradients that we calculated theoretically in the previous post are calculated using PyTorch and shown below.

e.backward()
display(Math(fr'\frac{{\partial e}}{{\partial a}} = {a.grad.item()}'))
print()
display(Math(fr'\frac{{\partial e}}{{\partial b}} = {b.grad.item()}'))
print()
display(Math(fr'\frac{{\partial e}}{{\partial c}} = {c.grad.item()}'))
print()
display(Math(fr'\frac{{\partial e}}{{\partial d}} = {d.grad.item()}'))

The properties of nodes after backward() call are shown below.

As you can see once the graph is built, calculating the gradients in PyTorch is a piece of case. It takes care of the differentiation for you. The jupyter notebook for this tutorial can be found at : https://github.com/msminhas93/ComputationalGraphs

This completes the discussion of computational graphs in PyTorch. In the next post I’ll be discussing computational graphs in TensorFlow. If you liked the post please do like, share and subscribe.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s