Skip to content
Go Back

Convergence Analysis of Gradient Descent

Edit this page

Prerequisites

Lipschitz Continuity

A function f:RnRmf: \mathbb{R}^n \rightarrow \mathbb{R}^m is said to be Lipschitz continuous (LL-continuous) if and only if there exists L>0\exists L > 0 such that for all x,yRn\forall x, y \in \mathbb{R}^n, we have

f(x)f(y)Lxy\|f(x) - f(y)\| \le L \|x - y\|

Lipschitz Smoothness

Similarly, f:RnRmf: \mathbb{R}^n \rightarrow \mathbb{R}^m is said to be Lipschitz smooth (LL-smooth) if and only if there exists L>0\exists L > 0 such that for all x,yRn\forall x, y \in \mathbb{R}^n, we have

f(x)f(y)Lxy\|\nabla f(x) - \nabla f(y)\| \le L \|x - y\|

Convex Function

A function f:RnRf: \mathbb{R}^n \rightarrow \mathbb{R} is a convex function if and only if for all x,yRn\forall x, y \in \mathbb{R}^n and for all λ[0,1]\forall \lambda \in [0, 1], the following holds:

f(λx+(1λ)y)λf(x)+(1λ)f(y)f(y)f(x)+f(x)T(yx)2f(x)0f(\lambda x + (1 - \lambda) y) \le \lambda f(x) + (1 - \lambda) f(y) \\ \Leftrightarrow f(y) \ge f(x) + \nabla f(x)^T (y - x) \\ \Leftrightarrow \nabla^2 f(x) \succcurlyeq 0
Note

For a matrix ARn×nA \in \mathbb{R}^{n\times n} and for all xRn\forall x \in \mathbb{R}^n:

  • If xTAx0x^T A x \ge 0, then AA is called a positive semi-definite matrix (p.s.d. matrix), denoted as A0A \succcurlyeq 0.
  • If xTAx>0x^T A x > 0, then AA is called a positive definite matrix (p.d. matrix), denoted as A0A \succ 0.

From the perspective of eigenvalues, for all eigenvalues λi,i=1,,n\lambda_i, \forall i=1,\cdots,n of AA:

  • A0A \succcurlyeq 0 is equivalent to λi0\lambda_i \ge 0.
  • A0A \succ 0 is equivalent to λi>0\lambda_i > 0.

Strongly Convex Function

A function f:RnRf: \mathbb{R}^n \rightarrow \mathbb{R} is strongly convex if and only if there exists μ>0\exists \mu > 0 such that for all x,yRn\forall x, y \in \mathbb{R}^n and for all λ[0,1]\forall \lambda \in [0, 1], the following holds:

f(λx+(1λ)y)λf(x)+(1λ)f(y)μ2λ(1λ)xy2f(y)f(x)+f(x)T(yx)+μ2xy22f(x)μIf(\lambda x + (1 - \lambda) y) \le \lambda f(x) + (1 - \lambda) f(y) - \frac{\mu}{2} \lambda (1 - \lambda) \|x - y\|^2 \\ \Leftrightarrow f(y) \ge f(x) + \nabla f(x)^T (y - x) + \frac{\mu}{2} \|x - y\|^2 \\ \Leftrightarrow \nabla^2 f(x) \succcurlyeq \mu I

Gradient Descent

Gradient descent is an iterative algorithm. Here we only consider its simplest form, which requires the function ff to be at least Lipschitz smooth. Then we have

f(xk+1)f(xk)+f(xk)T(xk+1xk)+L2xk+1xk2f(x_{k+1}) \le f(x_k) + \nabla f(x_k)^T (x_{k+1} - x_k) + \frac{L}{2} \|x_{k+1} - x_k\|^2

Thus, even though ff might not be a quadratic function, the iteration process of gradient descent can be viewed as operating on the quadratic function f(xk)+f(xk)T(xxk)+L2xxk2f(x_k) + \nabla f(x_k)^T (x - x_k) + \frac{L}{2} \|x - x_k\|^2. Therefore, the objective function being minimized is

qxk(x)=f(xk)+f(xk)T(xxk)+L2xxk2f(x)q_{x_k}(x) = f(x_k) + \nabla f(x_k)^T (x - x_k) + \frac{L}{2} \|x - x_k\|^2 \approx f(x)

To minimize the quadratic function qxk(x)q_{x_k}(x), we simply set its gradient to zero:

qxk(x)=f(xk)+L(xxk)=!0xk+1=x=xk1Lf(xk)\nabla q_{x_k}(x) = \nabla f(x_k) + L (x - x_k) \xlongequal{!} 0 \\ \Rightarrow x_{k+1} = x = x_k - \frac{1}{L} \nabla f(x_k)

Here, f(xk)-\nabla f(x_k) is the descent direction, and 1L\frac{1}{L} is the step size, denoted as α\alpha. From this, we can also derive

f(xk+1)f(xk)+f(xk)T(xk+1xk)+L2xk+1xk2=f(xk)1Lf(xk)2+12Lf(xk)2=f(xk)12Lf(xk)2\begin{align*} f(x_{k+1}) \le & f(x_k) + \nabla f(x_k)^T (x_{k+1} - x_k) + \frac{L}{2} \|x_{k+1} - x_k\|^2 \\ = & f(x_k) - \frac{1}{L} \|\nabla f(x_k)\|^2 + \frac{1}{2L} \|\nabla f(x_k)\|^2 \\ = & f(x_k) - \frac{1}{2L} \|\nabla f(x_k)\|^2 \end{align*}
Newton's Method

Besides gradient descent, we can also use Newton’s method for optimization. To solve h(x)=0h(x)=0, we can define g(x)=xαh(x)g(x)=x-\alpha h(x), where α>0\alpha>0 is a scaling factor. Then we can iterate as follows:

xk+1=g(xk)x_{k+1} = g(x_k)

When the iteration converges, we have

x=g(x)=xαh(x)h(x)=0x_* = g(x_*) = x_* - \alpha h(x_*) \\ \Rightarrow h(x_*) = 0

Convergence Criteria

Assume that xΩx_*\in\Omega in the domain is the minimizer of ff, with the corresponding function value f(x)=ff(x_*) = f_*. Then for any ε>0\forall\varepsilon > 0:

  1. If f(x)fεf(x)-f_* \le\varepsilon, then xx is called an ε\varepsilon-optimal point.
  2. If f(x)ε\|\nabla f(x)\| \le\varepsilon, then xx is called an ε\varepsilon-critical point.

Convergence Analysis

In the previous derivation, we obtained the step size α=1L\alpha = \frac{1}{L} for gradient descent applied to Lipschitz smooth functions. Therefore, the following analysis is based on this step size. It should be noted that, both in practical applications and theoretical analysis, for functions that are not only smooth but also convex, there exist better step sizes that lead to faster convergence. However, their derivation is more complex. Here we only consider the simplest case.

Lipschitz Smooth Functions

For a Lipschitz smooth function ff, we have

f(x+αd)f(x)+αf(x)Td+L2α2d2f(xk+1)f(xk)12Lf(xk)2f(xT)f(x0)12Lk=0T1f(xk)2k=0T1f(xk)22L(f(x0)f(xT))2L(f(x0)f)min0kT1f(xk)22LT(f(x0)f)min0kT1f(xk)2LT(f(x0)f)\begin{align*} f(x+\alpha d) \le & f(x) + \alpha\nabla f(x)^Td + \frac{L}{2}\alpha^2\|d\|^2 \\ \Rightarrow f(x_{k+1}) \le & f(x_k) - \frac{1}{2L}\|\nabla f(x_k)\|^2 \\ \Rightarrow f(x_T) \le & f(x_0) - \frac{1}{2L}\sum_{k=0}^{T-1}\|\nabla f(x_k)\|^2 \\ \Rightarrow \sum_{k=0}^{T-1}\|\nabla f(x_k)\|^2 \le & 2L\left(f(x_0) - f(x_T)\right) \\ \le & 2L\left(f(x_0) - f_*\right) \\ \Rightarrow \min_{0\le k\le T-1}\|\nabla f(x_k)\|^2 \le & \frac{2L}{T}\left(f(x_0) - f_*\right) \\ \Rightarrow \min_{0\le k\le T-1}\|\nabla f(x_k)\| \le & \sqrt{\frac{2L}{T}\left(f(x_0) - f_*\right)} \end{align*}

Thus, to achieve an ε\varepsilon-critical point, the number of iterations required must satisfy

2LT(f(x0)f)εT2Lε2(f(x0)f)\begin{align*} \sqrt{\frac{2L}{T}\left(f(x_0) - f_*\right)} \le & \varepsilon \\ \Rightarrow T \ge & \frac{2L}{\varepsilon^2}\left(f(x_0) - f_*\right) \end{align*}

Therefore, the required time complexity is O(1ε2)O\left(\frac{1}{\varepsilon^2}\right), which corresponds to sublinear convergence.

Rate of Convergence

For a sequence {an}\{a_n\}, if limnan0\lim_{n\to\infty}a_n\to0 and we have

limnan+1an=ρ\lim_{n\to\infty}\frac{a_{n+1}}{a_n} = \rho
  1. If ρ=0\rho=0, then {an}\{a_n\} is said to converge superlinearly.
  2. If ρ(0,1)\rho\in(0,1), then {an}\{a_n\} is said to converge linearly.
  3. If ρ=1\rho=1, then {an}\{a_n\} is said to converge sublinearly.

Lipschitz Smooth + Convex Function

For a convex function ff, we have

f(y)f(x)+f(x)T(yx)Let{y=xx=xkff(xk)+f(xk)T(xxk)f(xk)f+f(xk)T(xkx)\begin{align*} f(y) \ge & f(x) + \nabla f(x)^T(y - x) \\ \text{Let} \begin{cases} y=x_*\\ x=x_k \end{cases} \Rightarrow f_* \ge & f(x_k) + \nabla f(x_k)^T(x_* - x_k) \\ \Rightarrow f(x_k) \le & f_* + \nabla f(x_k)^T(x_k - x_*) \end{align*}

Combining with the inequality for Lipschitz smooth functions f(xk+1)f(xk)12Lf(xk)2f(x_{k+1}) \le f(x_k)-\frac{1}{2L}\left\|\nabla f(x_k)\right\|^2, and substituting, we get

f(xk+1)f+f(xk)T(xkx)12Lf(xk)2f(x_{k+1}) \le f_* + \nabla f(x_k)^T(x_k - x_*) - \frac{1}{2L}\left\|\nabla f(x_k)\right\|^2

Furthermore,

f(xk)2=L2xk+1xk2=L2(xk+1x2+xkx22<xk+1x,xkx>)=L2(xk+1x2+xkx22<xk1Lf(xk)x,xkx>)=L2(xk+1x2xkx2+2L<f(xk),xkx>)=L2(xk+1x2xkx2+2Lf(xk)T(xkx))\begin{align*} \left\|\nabla f(x_k)\right\|^2 = & L^2\left\|x_{k+1} - x_k\right\|^2 \\ = & L^2 \left(\left\|x_{k+1} - x_*\right\|^2 + \left\|x_k - x_*\right\|^2 - 2 \left<x_{k+1}-x_*,x_k-x_*\right>\right) \\ = & L^2 \left(\left\|x_{k+1} - x_*\right\|^2 + \left\|x_k - x_*\right\|^2 - 2 \left<x_k-\frac{1}{L}\nabla f(x_k)-x_*,x_k-x_*\right>\right) \\ = & L^2 \left(\left\|x_{k+1} - x_*\right\|^2 - \left\|x_k - x_*\right\|^2 + \frac{2}{L} \left<\nabla f(x_k),x_k-x_*\right>\right) \\ = & L^2 \left(\left\|x_{k+1} - x_*\right\|^2 - \left\|x_k - x_*\right\|^2 + \frac{2}{L} \nabla f(x_k)^T(x_k-x_*)\right) \\ \end{align*}

Substituting into the inequality above yields

f(xk+1)f+f(xk)T(xkx)L2(xk+1x2xkx2+2Lf(xk)T(xkx))=fL2(xk+1x2xkx2)k=0T1f(xk+1)TfL2(xTx2x0x2)Tf+L2x0x2\begin{align*} f(x_{k+1}) \le & f_* + \nabla f(x_k)^T(x_k - x_*) \\ & - \frac{L}{2}\left(\left\|x_{k+1} - x_*\right\|^2 - \left\|x_k - x_*\right\|^2 + \frac{2}{L} \nabla f(x_k)^T(x_k-x_*)\right) \\ = & f_* - \frac{L}{2}\left(\left\|x_{k+1} - x_*\right\|^2 - \left\|x_k - x_*\right\|^2\right) \\ \Rightarrow \sum_{k=0}^{T-1}f(x_{k+1}) \le & Tf_* - \frac{L}{2}\left(\left\|x_T - x_*\right\|^2 - \left\|x_0 - x_*\right\|^2\right) \\ \le & Tf_* + \frac{L}{2}\left\|x_0 - x_*\right\|^2 \end{align*}

Since for each step we have f(xk+1)f(xk)12Lf(xk)2f(x_{k+1}) \le f(x_k)-\frac{1}{2L}\left\|\nabla f(x_k)\right\|^2, it follows that k=0,,T1,f(xT)<f(xk)\forall k=0,\cdots,T-1,\quad f(x_T)<f(x_k). Therefore,

Tf(xT)Tf+L2x0x2f(xT)fL2Tx0x2\begin{align*} T f(x_T) \le & T f_* + \frac{L}{2}\left\|x_0 - x_*\right\|^2 \\ \Rightarrow f(x_T) - f_* \le & \frac{L}{2T}\left\|x_0 - x_*\right\|^2 \\ \end{align*}

To achieve an ε\varepsilon-optimal point, the number of iterations required must satisfy

L2Tx0x2εTL2εx0x2\begin{align*} \frac{L}{2T}\left\|x_0 - x_*\right\|^2 \le & \varepsilon \\ \Rightarrow T \ge & \frac{L}{2\varepsilon}\left\|x_0 - x_*\right\|^2 \end{align*}

Thus, the required time complexity is O(1ε)O\left(\frac{1}{\varepsilon}\right), which is still sublinear convergence.

Lipschitz Smooth + Strongly Convex Function

For a strongly convex function ff, we have

f(y)f(x)+f(x)T(yx)+μ2yx2Let{y=xyx=f(x)μff(x)1μf(x)2+μ21μ2f(x)2=f(x)12μf(x)2f(x)f12μf(x)2\begin{align*} f(y) \ge & f(x) + \nabla f(x)^T(y - x) + \frac{\mu}{2}\|y - x\|^2 \\ \text{Let} \begin{cases} y=x_*\\ y-x=-\frac{\nabla f(x)}{\mu} \end{cases} \Rightarrow f_* \ge & f(x) - \frac{1}{\mu}\left\|\nabla f(x)\right\|^2 + \frac{\mu}{2}\cdot\frac{1}{\mu^2}\left\|\nabla f(x)\right\|^2 \\ = & f(x) - \frac{1}{2\mu}\left\|\nabla f(x)\right\|^2 \\ \Rightarrow f(x)-f_* \le & \frac{1}{2\mu}\left\|\nabla f(x)\right\|^2 \end{align*}

Combining with the inequality for Lipschitz smooth functions f(xk+1)f(xk)12Lf(xk)2f(x_{k+1}) \le f(x_k)-\frac{1}{2L}\left\|\nabla f(x_k)\right\|^2, and substituting, we obtain

f(xk+1)f(xk)2μ2L(f(xk)f)f(xk+1)ff(xk)fμL(f(xk)f)=(1μL)(f(xk)f)f(xT)f(1μL)T(f(x0)f)\begin{align*} f(x_{k+1}) \le & f(x_k) - \frac{2\mu}{2L}\left(f(x_k)-f_*\right) \\ \Rightarrow f(x_{k+1})-f_* \le & f(x_k)-f_* - \frac{\mu}{L}\left(f(x_k)-f_*\right) \\ = & \left(1-\frac{\mu}{L}\right)\left(f(x_k)-f_*\right) \\ \Rightarrow f(x_T)-f_* \le & \left(1-\frac{\mu}{L}\right)^T\left(f(x_0)-f_*\right) \end{align*}

To achieve an ε\varepsilon-optimal point, the number of iterations required must satisfy

(1μL)T(f(x0)f)εf(x0)fε(1μL)Tlogf(x0)fεTlog11μLμLTLμlogf(x0)fε\begin{align*} \left(1-\frac{\mu}{L}\right)^T\left(f(x_0)-f_*\right) \le & \varepsilon \\ \Rightarrow \frac{f(x_0)-f_*}{\varepsilon} \le & \left(1-\frac{\mu}{L}\right)^{-T} \\ \Rightarrow \log\frac{f(x_0)-f_*}{\varepsilon} \le & T\underbrace{\log\frac{1}{1-\frac{\mu}{L}}}_{\approx \frac{\mu}{L}} \\ \Rightarrow T \ge & \frac{L}{\mu}\log\frac{f(x_0)-f_*}{\varepsilon} \end{align*}

Therefore, the required time complexity is O(log1ε)O\left(\log\frac{1}{\varepsilon}\right), which corresponds to linear convergence.

Miscellaneous Notes

Midterm season is incredibly busy. There are many things I want to write about, but I don’t even have time to write code for the project I’m currently working on, let alone these topics. I can only jot down what might be most useful for now and leave the rest for later.

There is much more to convex optimization, including accelerated gradient descent algorithms, stochastic gradient descent algorithms, constrained optimization, and so on. These are all very interesting topics. Of course, conducting convergence analysis for them can be quite challenging. I’ll write about them when I get the chance. Most likely they are too difficult, so I won’t.

References


Edit this page
Share this post:

Previous Post
Migrating the Blog Site Again
Next Post
First Look at GitHub Actions