5 Optimizer

We know how to create Model, we know how to find the Model’s loss, we know how to calculate gradients, but we don’t know how to actually update the parameters of the model using the gradients. It doesn’t matter how many parameters our model has if we don’t update its weights it is just a random number generator.

We will build SGD and Adam.

Note

This is how our folder structure currently looks like. In this chapter we will work inside babygrad/optim.py.


project/
├─ .venv/                # virtual environment 
├─ babygrad/             # source code
│   ├─ __init__.py
│   ├─ ops.py
│   ├─ tensor.py
│   └─ nn.py             
├─ examples/            
│   └─ simple_mnist.py
└─ tests/                #tests

What does it mean to use gradients to update weights of our model?

In the simple_mnist example:

logits = model.forward(x_batch)
#code 
loss = softmax_loss(logits, y_one_hot)
#zero/None the grads
for p in model.parameters():
    p.grad = None
#calculate the grads.
loss.backward()
#update the weights.
for p in model.parameters():
    p.data = p.data - lr * p.grad

What are we changing? We are changing the weights of the model using grad calculated by loss.backward().

So the only thing is finding the best weights for our model for good outputs.

for p in model.parameters():
    p.data = p.data - lr * p.grad

p.grad tells us the direction in which error increases. So we go opposite of it.

Changing weights like this is good but we can use different methods for changing the weights nicely.

Instead of writing the weight update loop always , we will use Optimizer. Lets first define a Base Optimizer class.

Before writing it, lets question ourselves. What should the base class contain?

Initialization of parameters.
Updating the parameters.

Anything else?

For each training loop we are calculating gradients and then based on those gradients updating the weights.

Should we keep the old gradients for a fresh training loop?

Not really. The old gradients were specific to the old weights. Once the weight update is done, they have served their purpose. If we don’t clear them, they will accumulate with the new gradients, confusing our model. So we

Zero the gradients at the start of every iteration..

FILE : babygrad/optim.py


class Optimizer:
    """
    Base class for all optimizers.
    Example of a subclass:
        class SGD(Optimizer):
            def __init__(self, params, lr=0.01):
                super().__init__(params)
                self.lr = lr
            def step(self):
                pass
    """
    def __init__(self, params):
        """
        Args:
            params (list[Parameter])
        """
        self.params = params
    def zero_grad(self):
        """
        Sets the gradients of all parameters to None.
        """
        for p in self.params:
            p.grad = None

    def step(self):
        """
        Performs a single optimization step (e.g., updating parameters).
        """
        raise NotImplementedError

5.1 SGD

What is Learning rate?

Learning rate is a number that controls how big a step the model takes when adjusting
its weights.

This is the most straightforward optimization algorithm. It follows the simple update: subtract the gradient (scaled by a learning rate) from the parameters. \[ \text{updated\_weights} = \text{current\_weights} - \text{learning\_rate} \times \text{gradients} \]

FILE : babygrad/optim.py

Exercise 5.1

Lets write the SGD class.


class SGD(Optimizer):
    """
    This optimizer updates parameters by taking a step in the direction
     of the negative gradient, multipled by the learning rate.
    """
    def __init__(self, params, lr=0.01):
        
        super().__init__(params)
        self.lr = lr

    def step(self):
        """
        Performs a single optimization step.
        """
        
        # Note: The update is performed on the .data attribute
        # Dont want `step` to be a part of computation graph.
        #check if param.grad is None or not.

Note

#example
optimizer.zero_grad()
loss.backward()
optimizer.step()

5.2 Adam

Now we have SGD. It works. One thing to observe is we are using the same learning rate for all parameters. And it doesn’t remember what happened at past updates.

What if using the same learning rate for all parameters is not really necessary? Are all parameters equally important? Do they all need the same size updates?

Some parameters might need big changes, others might need tiny adjustments.

So what if we could give each parameter its own effective learning rate?

But how do we decide what effective learning rate each parameter should get?

We look at the history of the gradients. If a parameter has been jumping around wildly, we should be cautious. If it has been moving steadily in one direction, we can be more aggressive.

So we will look at the history of gradients.

If a parameter’s gradient has been consistently large, Maybe we should be careful and take smaller steps.

If a parameter’s gradient has been small and stable,Maybe we can be bold and take larger steps.

So we need to track two things:

Where are we going?
How confident we are? (How steady is the path)?

This is exactly what Adam does!

Adam keeps track of:

First moment (mean): Average of past Gradients
Second moment (variance): Average of past Squared gradients

Why squared gradients? Because they tell us about the magnitude, regardless of direction.

\[ \text{moving\_average\_gradient} = \beta_1 \times \text{moving\_average\_gradient} + (1 - \beta_1) \times \text{current\_gradient} \]

\[ \text{moving\_average\_squared\_gradient} = \beta_2 \times \text{moving\_average\_squared\_gradient} + (1 - \beta_2) \times \text{current\_gradient}^2 \]

\[ \text{corrected\_gradient} = \frac{\text{moving\_average\_gradient}}{1 - \beta_1^{\text{step}}} \]

\[ \text{corrected\_squared\_gradient} = \frac{\text{moving\_average\_squared\_gradient}}{1 - \beta_2^{\text{step}}} \]

\[ \text{updated\_weights} = \text{current\_weights} - \frac{\text{learning\_rate} \times \text{corrected\_gradient}}{\sqrt{\text{corrected\_squared\_gradient}} + \text{epsilon}} \]

\(\beta_1\) (usually 0.9): How much we care about the recent direction.
\(\beta_2\) (usually 0.999): How much we care about magnitude

Let’s build it!

FILE : babygrad/optim.py

Exercise 5.2

Lets write the Adam class.


class Adam(Optimizer):
    """
    Implements the Adam optimization algorithm.
    """
    def __init__(
        self,
        params,
        lr=0.001,
        beta1=0.9,
        beta2=0.999,
        eps=1e-8,
    ):
        """

        Args:
            params (list[Tensor])
            lr (float, optional): 
            beta1 (float, optional):
            beta2 (float, optional): 
            eps (float, optional):
           
        """
        super().__init__(params)
        self.lr = lr
        self.beta1 = beta1
        self.beta2 = beta2
        self.eps = eps
        self.t = 0 

        self.m = {}
        self.v = {} 

    def step(self):
        
        self.t += 1
        for param in self.params:
            grad = param.grad 
            mt = self.m.get(param, 0) * self.beta1 + (1 - self.beta1)
                 * grad
            self.m[param] = mt
            vt = self.v.get(param, 0) * self.beta2 + (1 - self.beta2)
                 * (grad ** 2)
            self.v[param] = vt


        #Your solution
        # Do inplace update : param.data -= ....