Demystifying Pytorch Learning Rate schedulers
Demystifying Pytorch Learning Rate scheduler with matplotlib
- 1. LAMBDA LR
- 2. MultiplicativeLR
- 3. StepLR
- 4. MultiStepLR
- 5. ExponentialLR
- 6. CosineAnnealingLR
- 7. CyclicLR - triangular
- 7. CyclicLR - triangular2
- 7. CyclicLR - exp_range
- 8.OneCycleLR - cos
- 8.OneCycleLR - linear
- 9.CosineAnnealingWarmRestarts
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import torch
import matplotlib.pyplot as plt
model = torch.nn.Linear(2, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=100)
lambda1 = lambda epoch: 0.65 ** epoch
scheduler = torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda=lambda1)
lrs = []
for i in range(10):
optimizer.step()
lrs.append(optimizer.param_groups[0]["lr"])
# print("Factor = ", round(0.65 ** i,3)," , Learning Rate = ",round(optimizer.param_groups[0]["lr"],3))
scheduler.step()
plt.plot(range(10),lrs)
model = torch.nn.Linear(2, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=100)
lmbda = lambda epoch: 0.65 ** epoch
scheduler = torch.optim.lr_scheduler.MultiplicativeLR(optimizer, lr_lambda=lmbda)
lrs = []
for i in range(10):
optimizer.step()
lrs.append(optimizer.param_groups[0]["lr"])
# print("Factor = ",0.95," , Learning Rate = ",optimizer.param_groups[0]["lr"])
scheduler.step()
plt.plot(range(10),lrs)
3. StepLR
Decays the learning rate of each parameter group by gamma every step_size epochs. Notice that such decay can happen simultaneously with other changes to the learning rate from outside this scheduler. When last_epoch=-1, sets initial lr as lr.
$$ l r_{\text {epoch}}=\left\{\begin{array}{ll} Gamma * l r_{\text {epoch - 1}}, & \text { if } {\text {epoch % step_size}}=0 \\ l r_{\text {epoch - 1}}, & \text { otherwise } \end{array}\right. $$model = torch.nn.Linear(2, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=100)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=2, gamma=0.1)
lrs = []
for i in range(10):
optimizer.step()
lrs.append(optimizer.param_groups[0]["lr"])
# print("Factor = ",0.1 if i!=0 and i%2!=0 else 1," , Learning Rate = ",optimizer.param_groups[0]["lr"])
scheduler.step()
plt.plot(range(10),lrs)
4. MultiStepLR
Decays the learning rate of each parameter group by gamma once the number of epoch reaches one of the milestones. Notice that such decay can happen simultaneously with other changes to the learning rate from outside this scheduler. When last_epoch=-1, sets initial lr as lr.
$$ l r_{\text {epoch}}=\left\{\begin{array}{ll} Gamma * l r_{\text {epoch - 1}}, & \text { if } {\text{ epoch in [milestones]}} \\ l r_{\text {epoch - 1}}, & \text { otherwise } \end{array}\right. $$model = torch.nn.Linear(2, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=100)
scheduler = torch.optim.lr_scheduler.MultiStepLR(optimizer, milestones=[6,8,9], gamma=0.1)
lrs = []
for i in range(10):
optimizer.step()
lrs.append(optimizer.param_groups[0]["lr"])
# print("Factor = ",0.1 if i in [6,8,9] else 1," , Learning Rate = ",optimizer.param_groups[0]["lr"])
scheduler.step()
plt.plot(range(10),lrs)
model = torch.nn.Linear(2, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=100)
scheduler = torch.optim.lr_scheduler.ExponentialLR(optimizer, gamma=0.1)
lrs = []
for i in range(10):
optimizer.step()
lrs.append(optimizer.param_groups[0]["lr"])
# print("Factor = ",0.1," , Learning Rate = ",optimizer.param_groups[0]["lr"])
scheduler.step()
plt.plot(lrs)
6. CosineAnnealingLR
Set the learning rate of each parameter group using a cosine annealing schedule. When last_epoch=-1, sets initial lr as lr. Notice that because the schedule is defined recursively, the learning rate can be simultaneously modified outside this scheduler by other operators. If the learning rate is set solely by this scheduler, the learning rate at each step becomes:
$$ \eta_{t}=\eta_{\min }+\frac{1}{2}\left(\eta_{\max }-\eta_{\min }\right)\left(1+\cos \left(\frac{T_{c u r}}{T_{\max }} \pi\right)\right) $$It has been proposed in SGDR: Stochastic Gradient Descent with Warm Restarts. Note that this only implements the cosine annealing part of SGDR, and not the restarts.https://arxiv.org/abs/1608.03983
model = torch.nn.Linear(2, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=100)
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=10, eta_min=0)
lrs = []
for i in range(100):
optimizer.step()
lrs.append(optimizer.param_groups[0]["lr"])
# print("Factor = ",i," , Learning Rate = ",optimizer.param_groups[0]["lr"])
scheduler.step()
plt.plot(lrs)
model = torch.nn.Linear(2, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
scheduler = torch.optim.lr_scheduler.CyclicLR(optimizer, base_lr=0.001, max_lr=0.1,step_size_up=5,mode="triangular")
lrs = []
for i in range(100):
optimizer.step()
lrs.append(optimizer.param_groups[0]["lr"])
# print("Factor = ",i," , Learning Rate = ",optimizer.param_groups[0]["lr"])
scheduler.step()
plt.plot(lrs)
model = torch.nn.Linear(2, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
scheduler = torch.optim.lr_scheduler.CyclicLR(optimizer, base_lr=0.001, max_lr=0.1,step_size_up=5,mode="triangular2")
lrs = []
for i in range(100):
optimizer.step()
lrs.append(optimizer.param_groups[0]["lr"])
# print("Factor = ",i," , Learning Rate = ",optimizer.param_groups[0]["lr"])
scheduler.step()
plt.plot(lrs)
model = torch.nn.Linear(2, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=100)
scheduler = torch.optim.lr_scheduler.CyclicLR(optimizer, base_lr=0.001, max_lr=0.1,step_size_up=5,mode="exp_range",gamma=0.85)
lrs = []
for i in range(100):
optimizer.step()
lrs.append(optimizer.param_groups[0]["lr"])
# print("Factor = ",i," , Learning Rate = ",optimizer.param_groups[0]["lr"])
scheduler.step()
plt.plot(lrs)
8.OneCycleLR - cos
Sets the learning rate of each parameter group according to the 1cycle learning rate policy. The 1cycle policy anneals the learning rate from an initial learning rate to some maximum learning rate and then from that maximum learning rate to some minimum learning rate much lower than the initial learning rate. This policy was initially described in the paper Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates.
The 1cycle learning rate policy changes the learning rate after every batch. step should be called after a batch has been used for training.
This scheduler is not chainable.
model = torch.nn.Linear(2, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer, max_lr=0.1, steps_per_epoch=10, epochs=10)
lrs = []
for i in range(100):
optimizer.step()
lrs.append(optimizer.param_groups[0]["lr"])
# print("Factor = ",i," , Learning Rate = ",optimizer.param_groups[0]["lr"])
scheduler.step()
plt.plot(lrs)
model = torch.nn.Linear(2, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
scheduler = torch.optim.lr_scheduler.OneCycleLR(optimizer, max_lr=0.1, steps_per_epoch=10, epochs=10,anneal_strategy='linear')
lrs = []
for i in range(100):
optimizer.step()
lrs.append(optimizer.param_groups[0]["lr"])
# print("Factor = ",i," , Learning Rate = ",optimizer.param_groups[0]["lr"])
scheduler.step()
plt.plot(lrs)
import torch
import matplotlib.pyplot as plt
model = torch.nn.Linear(2, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
lr_sched = torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(optimizer, T_0=10, T_mult=1, eta_min=0.001, last_epoch=-1)
lrs = []
for i in range(100):
lr_sched.step()
lrs.append(
optimizer.param_groups[0]["lr"]
)
plt.plot(lrs)
import torch
import matplotlib.pyplot as plt
model = torch.nn.Linear(2, 1)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
lr_sched = torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(optimizer, T_0=10, T_mult=2, eta_min=0.01, last_epoch=-1)
lrs = []
for i in range(300):
lr_sched.step()
lrs.append(
optimizer.param_groups[0]["lr"]
)
plt.plot(lrs)