📌 According to the authors of the paper, a suitable learning rate for Lion is typically 3-10 times lower than that used with Adam(w). Since the effective weight decay is lr * λ, the value of the decoupled weight decay λ used for Lion is 3-10 times larger than that used with Adam(w) to maintain similar strength.
📌 The initial value, peak value and final value in learning rate scheduling must change simultaneously with the same relationship to Adam(w).
📌 Learning rate scheduling: The authors use the same learning rate schedule for Lion as for Adam(w) in the paper. However, they observed a higher gain when using a cosine decay schedule to train ViT, compared to a square -root decay schedule .
📌 β1 and β2: The default values for β1 and β2 in Adam(w) are set to 0.9 and 0.999, respectively, with an ε of 1e−8, while in Lion, the default values for β1 and β2 are discovered through the program discovery process and are set to 0.9 and 0.99, respectively.
📌 Just as humans reduce β2 to 0.99 or lower and increase ε to 1e-6 in Adam(w) to improve stability, using β1=0.95, β2=0.98 in Lion may also be helpful to mitigate instability during training, the authors suggest. This was corroborated by a researcher.
Leave a Reply