5 Optimizer
We know how to create Model, we know how to find the Model’s loss, we know how to calculate gradients, but we don’t know how to actually update the parameters of the model using the gradients. It doesn’t matter how many parameters our model has if we don’t update its weights it is just a random number generator.
We will build SGD and Adam.
This is how our folder structure currently looks like. In this chapter we will work inside babygrad/optim.py.
project/
├─ .venv/ # virtual environment
├─ babygrad/ # source code
│ ├─ __init__.py
│ ├─ ops.py
│ ├─ tensor.py
│ └─ nn.py
├─ examples/
│ └─ simple_mnist.py
└─ tests/ #testsWhat does it mean to use gradients to update weights of our model?
In the simple_mnist example:
logits = model.forward(x_batch)
#code
loss = softmax_loss(logits, y_one_hot)
#zero/None the grads
for p in model.parameters():
p.grad = None
#calculate the grads.
loss.backward()
#update the weights.
for p in model.parameters():
p.data = p.data - lr * p.gradWhat are we changing? We are changing the weights of the model using grad calculated by loss.backward().
So the only thing is finding the best weights for our model for good outputs.
for p in model.parameters():
p.data = p.data - lr * p.gradp.grad tells us the direction in which error increases. So we go opposite of it.
Changing weights like this is good but we can use different methods for changing the weights nicely.
Instead of writing the weight update loop always , we will use Optimizer. Lets first define a Base Optimizer class.
Before writing it, lets question ourselves. What should the base class contain?
- Initialization of parameters.
- Updating the parameters.
Anything else?
For each training loop we are calculating gradients and then based on those gradients updating the weights.
Should we keep the old gradients for a fresh training loop?
Not really. The old gradients were specific to the old weights. Once the weight update is done, they have served their purpose. If we don’t clear them, they will accumulate with the new gradients, confusing our model. So we
- Zero the gradients at the start of every iteration..
FILE : babygrad/optim.py
class Optimizer:
"""
Base class for all optimizers.
Example of a subclass:
class SGD(Optimizer):
def __init__(self, params, lr=0.01):
super().__init__(params)
self.lr = lr
def step(self):
pass
"""
def __init__(self, params):
"""
Args:
params (list[Parameter])
"""
self.params = params
def zero_grad(self):
"""
Sets the gradients of all parameters to None.
"""
for p in self.params:
p.grad = None
def step(self):
"""
Performs a single optimization step (e.g., updating parameters).
"""
raise NotImplementedError5.1 SGD
Learning rate is a number that controls how big a step the model takes when adjusting
its weights.
This is the most straightforward optimization algorithm. It follows the simple update: subtract the gradient (scaled by a learning rate) from the parameters. \[ \text{updated\_weights} = \text{current\_weights} - \text{learning\_rate} \times \text{gradients} \]
FILE : babygrad/optim.py
Lets write the SGD class.
class SGD(Optimizer):
"""
This optimizer updates parameters by taking a step in the direction
of the negative gradient, multipled by the learning rate.
"""
def __init__(self, params, lr=0.01):
super().__init__(params)
self.lr = lr
def step(self):
"""
Performs a single optimization step.
"""
# Note: The update is performed on the .data attribute
# Dont want `step` to be a part of computation graph.
#check if param.grad is None or not.
#example
optimizer.zero_grad()
loss.backward()
optimizer.step()5.2 Adam
Now we have SGD. It works. One thing to observe is we are using the same learning rate for all parameters. And it doesn’t remember what happened at past updates.
What if using the same learning rate for all parameters is not really necessary? Are all parameters equally important? Do they all need the same size updates?
Some parameters might need big changes, others might need tiny adjustments.
So what if we could give each parameter its own effective learning rate?
But how do we decide what effective learning rate each parameter should get?
We look at the history of the gradients. If a parameter has been jumping around wildly, we should be cautious. If it has been moving steadily in one direction, we can be more aggressive.
So we will look at the history of gradients.
If a parameter’s gradient has been consistently large, Maybe we should be careful and take smaller steps.
If a parameter’s gradient has been small and stable,Maybe we can be bold and take larger steps.
So we need to track two things:
- Where are we going?
- How confident we are? (How steady is the path)?
This is exactly what Adam does!
Adam keeps track of:
- First moment (mean): Average of past Gradients
- Second moment (variance): Average of past Squared gradients
Why squared gradients? Because they tell us about the magnitude, regardless of direction.
\[ \text{moving\_average\_gradient} = \beta_1 \times \text{moving\_average\_gradient} + (1 - \beta_1) \times \text{current\_gradient} \]
\[ \text{moving\_average\_squared\_gradient} = \beta_2 \times \text{moving\_average\_squared\_gradient} + (1 - \beta_2) \times \text{current\_gradient}^2 \]
\[ \text{corrected\_gradient} = \frac{\text{moving\_average\_gradient}}{1 - \beta_1^{\text{step}}} \]
\[ \text{corrected\_squared\_gradient} = \frac{\text{moving\_average\_squared\_gradient}}{1 - \beta_2^{\text{step}}} \]
\[ \text{updated\_weights} = \text{current\_weights} - \frac{\text{learning\_rate} \times \text{corrected\_gradient}}{\sqrt{\text{corrected\_squared\_gradient}} + \text{epsilon}} \]
- \(\beta_1\) (usually 0.9): How much we care about the recent direction.
- \(\beta_2\) (usually 0.999): How much we care about magnitude
Let’s build it!
FILE : babygrad/optim.py
Lets write the Adam class.
class Adam(Optimizer):
"""
Implements the Adam optimization algorithm.
"""
def __init__(
self,
params,
lr=0.001,
beta1=0.9,
beta2=0.999,
eps=1e-8,
):
"""
Args:
params (list[Tensor])
lr (float, optional):
beta1 (float, optional):
beta2 (float, optional):
eps (float, optional):
"""
super().__init__(params)
self.lr = lr
self.beta1 = beta1
self.beta2 = beta2
self.eps = eps
self.t = 0
self.m = {}
self.v = {}
def step(self):
self.t += 1
for param in self.params:
grad = param.grad
mt = self.m.get(param, 0) * self.beta1 + (1 - self.beta1)
* grad
self.m[param] = mt
vt = self.v.get(param, 0) * self.beta2 + (1 - self.beta2)
* (grad ** 2)
self.v[param] = vt
#Your solution
# Do inplace update : param.data -= ....