A core, emerging problem in nonconvex optimization involves the escape of saddle points. While recent research has shown that gradient descent (GD) generically escapes saddle points asymptotically (see Rong Ge’s and Ben Recht’s blog posts), the critical open problem is one of efficiency — is GD able to move past saddle points quickly, or can it be slowed down significantly? How does the rate of escape scale with the ambient dimensionality? In this post, we describe our recent work with Rong Ge, Praneeth Netrapalli and Sham Kakade, that provides the first provable positive answer to the efficiency question, showing that, rather surprisingly, GD augmented with suitable perturbations escapes saddle points efficiently; indeed, in terms of rate and dimension dependence it is almost as if the saddle points aren’t there!
Perturbing Gradient Descent
We are in the realm of classical gradient descent (GD) — given a function
where
Clearly GD will never move away from a stationary point if started there (even a local maximum); thus, to provide general guarantees, it is necessary to modify GD slightly to incorporate some degree of randomness. Two simple methods have been studied in the literature:
-
Intermittent Perturbations: Ge, Huang, Jin and Yuan 2015 considered adding occasional random perturbations to GD, and were able to provide the first polynomial time guarantee for GD to escape saddle points. (See also Rong Ge’s post )
-
Random Initialization: Lee et al. 2016 showed that with only random initialization, GD provably avoids saddle points asymptotically (i.e., as the number of steps goes to infinity). (see also Ben Recht’s post)
Asymptotic — and even polynomial time —results are important for the general theory, but they stop short of explaining the success of gradient-based algorithms in practical nonconvex problems. And they fail to provide reassurance that runs of GD can be trusted — that we won’t find ourselves in a situation in which the learning curve flattens out for an indefinite amount of time, with the user having no way of knowing that the asymptotics have not yet kicked in. Lastly, they fail to provide reassurance that GD has the kind of favorable properties in high dimensions that it is known to have for convex problems.
One reasonable approach to this issue is to consider second-order (Hessian-based) algorithms. Although these algorithms are generally (far) more expensive per iteration than GD, and can be more complicated to implement, they do provide the kind of geometric information around saddle points that allows for efficient escape. Accordingly, a reasonable understanding of Hessian-based algorithms has emerged in the literature, and positive efficiency results have been obtained.
Is GD also efficient? Or is the Hessian necessary for fast escape of saddle points?
A negative result emerges to this first question if one considers the random initialization strategy discussed. Indeed, this approach is provably inefficient in general, taking exponential time to escape saddle points in the worst case (see “On the Necessity of Adding Perturbations” section).
Somewhat surprisingly, it turns out that we obtain a rather different — and positive — result if we consider the perturbation strategy. To be able to state this result, let us be clear on the algorithm that we analyze:
Perturbed gradient descent (PGD)
- for
t=1,2,… doxt←xt−1−η∇f(xt−1) if perturbation condition holds then xt←xt+ξt
Here the perturbation
Strict-Saddle and Second-order Stationary Points
We define saddle points in this post to include both classical saddle points as well as local maxima. They are stationary points which are locally maximized along at least one direction. Saddle points and local minima can be categorized according to the minimum eigenvalue of Hessian:
We further call the saddle points in the last category, where
While non-strict saddle points can be flat in the valley, strict saddle points require that there is at least one direction along which the curvature is strictly negative. The presence of such a direction gives a gradient-based algorithm the possibility of escaping the saddle point. In general, distinguishing local minima and non-strict saddle points is NP-hard; therefore, we — and previous authors — focus on escaping strict saddle points.
Formally, we make the following two standard assumptions regarding smoothness.
Assumption 1:
f isℓ -gradient-Lipschitz, i.e.
∀x1,x2,|∇f(x1)−∇f(x2)|≤ℓ|x1−x2| .
Assumption 2:f isρ -Hessian-Lipschitz, i.e.
∀x1,x2 ,|∇2f(x1)−∇2f(x2)|≤ρ|x1−x2| .
Similarly to classical theory, which studies convergence to a first-order stationary point,
Definition: A point
x is anϵ -second-order stationary point if:
|∇f(x)|≤ϵ , andλmin(∇2f(x))≥−ρϵ−−√ .
In this definition,
Applications
In a wide range of practical nonconvex problems it has been proved that all saddle points are strict — such problems include, but not are limited to, principal components analysis, canonical correlation analysis, orthogonal tensor decomposition, phase retrieval, dictionary learning, matrix sensing, matrix completion, and other nonconvex low-rank problems.
Furthermore, in all of these nonconvex problems, it also turns out that all local minima are global minima. Thus, in these cases, any general efficient algorithm for finding
Escaping Saddle Point with Negligible Overhead
In the classical case of first-order stationary points, GD is known to have very favorable theoretical properties:
Theorem (Nesterov 1998): If Assumption 1 holds, then GD, with
η=1/ℓ , finds anϵ -first-order stationary point in2ℓ(f(x0)−f⋆)/ϵ2 iterations.
In this theorem,
We now wish to address the corresponding problem for second-order stationary points. What is the best we can hope for? Can we also achieve
- A dimension-free number of iterations;
- An
O(1/ϵ2) convergence rate; - The same dependence on
ℓ and(f(x0)−f⋆) as in (Nesterov 1998)?
Rather surprisingly, the answer is Yes to all three questions (up to small log factors).
Main Theorem: If Assumptions 1 and 2 hold, then PGD, with
η=O(1/ℓ) , finds anϵ -second-order stationary point inO~(ℓ(f(x0)−f⋆)/ϵ2) iterations with high probability.
Here
We turn to a discussion of some of the intuitions underlying these results.
Why do polylog(d) iterations suffice?
Our strict-saddle assumption means that there is only, in the worst case, one direction in
Consider a simple case in which we assume that the function is
quadratic in the neighborhood of the saddle point. That is, let the
objective function be
It is straightforward to work out the general form of the iterates in this case:
Assume that we start at the saddle point at zero, then add a perturbation so that
Set the step size to be
A simple probability argument shows that sampling uniformly in
Pancake-shape stuck region for general Hessian
We can conclude that for the case of a constant Hessian, only when the perturbation
Earlier attempts to analyze the dynamics around saddle points tried to the approximate stuck region by a flat set. This results in a requirement of an extremely small step size and a correspondingly very large runtime complexity. Our sharp rate depends on a key observation — although we don’t know the shape of the stuck region, we know it is very thin.
In order to characterize the “thinness” of this pancake, we studied pairs of hypothetical perturbation points
On the Necessity of Adding Perturbations
We have discussed two possible ways to modify the standard gradient descent algorithm, the first by adding intermittent perturbations, and the second by relying on random initialization. Although the latter exhibits asymptotic convergence, it does not yield efficient convergence in general; in recent joint work with Simon Du, Jason Lee, Barnabas Poczos, and Aarti Singh, we have shown that even with fairly natural random initialization schemes and non-pathological functions, GD with only random initialization can be significantly slowed by saddle points, taking exponential time to escape. The behavior of PGD is strikingingly different — it can generically escape saddle points in polynomial time.
To establish this result, we considered random initializations from a
very general class including Gaussians and uniform distributions over
the hypercube, and we constructed a smooth objective function that
satisfies both Assumptions 1 and 2. This function is constructed such
that, even with random initialization, with high probability both GD and
PGD have to travel sequentially in the vicinity of
When GD travels in the vicinity of a sequence of saddle points, it
can get closer and closer to the later saddle points, and thereby take
longer and longer to escape. Indeed, the time to escape the
Conclusion
In this post, we have shown that a perturbed form of gradient descent can converge to a second-order-stationary point at almost the same rate as standard gradient descent converges to a first-order-stationary point. This implies that Hessian information is not necessary for to escape saddle points efficiently, and helps to explain why basic gradient-based algorithms such as GD (and SGD) work surprisingly well in the nonconvex setting. This new line of sharp convergence results can be directly applied to nonconvex problem such as matrix sensing/completion to establish efficient global convergence rates.
There are of course still many open problems in general nonconvex optimization. To name a few: will adding momentum improve the convergence rate to a second-order stationary point? What type of local minima are tractable and are there useful structural assumptions that we can impose on local minima so as to avoid local minima efficiently? We are making slow but steady progress on nonconvex optimization, and there is the hope that at some point we will transition from “black art” to “science”.
No comments:
Post a Comment