this post was submitted on 19 Nov 2023
1 points (100.0% liked)

Machine Learning

1 readers
1 users here now

Community Rules:

founded 10 months ago
MODERATORS
 

I'm trying to teach a lesson on gradient descent from a more statistical and theoretical perspective, and need a good example to show its usefulness.

What is the simplest possible algebraic function that would be impossible or rather difficult to optimize for, by setting its 1st derivative to 0, but easily doable with gradient descent? I preferably want to demonstrate this in context linear regression or some extremely simple machine learning model.

top 25 comments
sorted by: hot top controversial new old
[–] ObjectiveMountain900@alien.top 1 points 10 months ago (1 children)

Linear regression with a L1 regularization term is a great example. While in some cases a closed form exists (if we only have a single predictor, or if X is orthonormal), generally gradient descent is used.

[–] charlesGodman@alien.top 1 points 10 months ago

To be precise: linear regression with L2 loss and L1 regularization. L1 regression with L1 penalty can be solved using a linear program.

[–] muggafugga@alien.top 1 points 10 months ago

Here's an example of fitting a 3rd order polynomial I worked with recently that might be useful for you. The code is largely part of the pytorch documentation, with an animation to illustrate the process.

https://gist.github.com/jlines/00eae9590d26b31cb98b333e537fcb31

[–] TheGuywithTehHat@alien.top 1 points 10 months ago (1 children)
[–] Inside_Tangerine_784@alien.top 1 points 10 months ago

Abs is an example of a function for which gradient descent does not work. No matter how small the step, it oscillate near zero.

[–] lacker@alien.top 1 points 10 months ago

A simple example is when you have a set of n points, and you want to find a local maximum of “distance away from the closest point in the set”. The gradient is easy to approximate, and you can show visually what’s happening to students. There are algorithms to solve this exactly but they are much more complicated than gradient descent.

For one step more complexity you could do an svm algorithm. Ie, divide points into red and blue, and find the line that separates them into two groups, maximizing the margin to the closest point. This is a real problem for which gradient descent (well, some variant of it) is the best solution. It’s also nicely visualizable.

[–] idkname999@alien.top 1 points 10 months ago (2 children)

Logistic Regression is a simple machine learning model with no closed form solution.

[–] ravshanbeksk@alien.top 1 points 10 months ago (1 children)

You can write a log likelihood function and take its derivative and set it equal to zero.

[–] DoctorFuu@alien.top 1 points 10 months ago

no, to convince yourself, just try to do it.

[–] neyman-pearson@alien.top 1 points 10 months ago

Why is this not the most upvoted answer?

[–] Ok_Reality2341@alien.top 1 points 10 months ago

6th degree polynomial?

[–] lithiumdeuteride@alien.top 1 points 10 months ago (3 children)

How about this function:

f(x) = abs(x) - cos(x)

Setting the first derivative equal to zero leads to trouble, as the derivative doesn't exist at the minimum, and there are infinitely many points which have zero slope. Nevertheless, the function has a clear minimum which gradient descent of any finite step size should find.

[–] neyman-pearson@alien.top 1 points 10 months ago (1 children)

Except you could get stuck at an arbitary saddle point

[–] DoctorFuu@alien.top 1 points 10 months ago (1 children)

Isn't the probability of getting exactly on one 0 though?

[–] jamochasnake@alien.top 1 points 10 months ago

of course, every locally lipschitz function is differentiable almost everywhere, i.e., the set of points where it is not differentiable is a measure zero set. So if you drew a point randomly (with distribution absolutely continuous w.r.t. the lebesgue measure) then you avoid such points with probability. However, this has NOTHING to do with actually performing an algorithm like gradient descent - in these algorithms you are NOT simply drawing points randomly and you cannot guarantee you avoid these points a priori.

[–] yannbouteiller@alien.top 1 points 10 months ago

It does answer OP's question, but is of limited practical relevance for an ML course IMHO.

We typically use GD in approximately pseudoconvex optimization landscapes, not because there are infinitely many or even any single saddle point. To escape local optima and saddle points, we rely on other tricks like SGD.

[–] jamochasnake@alien.top 1 points 10 months ago

ignoring the fact that you can't prove anything about gradient descent for nonsmooth functions (you need subgradients), a finite step size will NOT work. You can end up at the minimum and compute a subgradient which is different from zero there, and end up moving AWAY from the minimizer. You need to decay your step size to ensure that you end up at the minimizer (asymptotically).

[–] ravshanbeksk@alien.top 1 points 10 months ago

Give an example of linear regression that does not support zero conditional mean assumption. Like auto regressive process. You cannot take derivative then because it is not differentiable. However gradient descent doesn't care about all that

[–] un_om_de_cal@alien.top 1 points 10 months ago

Functions that are exponential + polynomial, or trigonometric function + polynomial in general do not have analytical solutions for finding the roots.

So you could use "minimize f(x)=e^x\ +x^2\ " as an example.

[–] wojcech@alien.top 1 points 10 months ago

Go back to your history: Cauchy is the earliest person I'm aware of to have used gradient descent, and he motivated it as

one ordinarily starts by reducing them to a single one by successive eliminations, to eventually solve for good the resulting equation, if possible. But it is important to observe that 1◦ in many cases, the elimination cannot be performed in any way; 2◦ the resulting equation is usually very complicated, even though the given equations are rather simple

That is, the usefulness of gradient descent is motivated when you have rough idea of when you are close to the minimum, but you don't want to go through the hassle of algebra. (realistically, if you can solve it with gradient descent, you could probably solve it algebraicly, we just don't have the same stupidly easy to implement computational routines for it)

https://www.math.uni-bielefeld.de/documenta/vol-ismp/40_lemarechal-claude.pdf

[–] tarsiospettro@alien.top 1 points 10 months ago (1 children)

I do not understand most comments here.

Gradient descent is just the same as the tangent method. It is ubiquitously used, e.g. find minimum of whatever polynomial of degree >= 4.

Calculating derivative and finding 0 of the derivative is still the same problem. You look for a numerical solution using gradient descents. Bisection is slower and not so effective for multivariate functions.

I would say the opposite: there are more optimisation problems where gradient descent is used than not (excluding everything which can be solved by linear systems)

[–] DoctorFuu@alien.top 1 points 10 months ago

This reminds me that I still didn't read the paper on forward-forward algorithm and thus I'm not even sure if it's still gradient descent.

[–] Hothapeleno@alien.top 1 points 10 months ago

Find the minimum or maximum of an n-dimensional matrix of data instead of a function. Generate the data with any nice to visualise function. The move on from gradient descent to simulated annealing.

[–] howtorewriteaname@alien.top 1 points 10 months ago

the error function

[–] hank-particles-pym@alien.top 1 points 10 months ago

Curious about this answer, I use AI to help me understand a lot of what happens with questions like these, but as its not my expertise, I am always suspect of the answers i get. So I sent your question to Phind:

A good example of a function that is difficult to optimize by setting its first derivative to zero, but can be easily optimized using gradient descent, is the function f(x) = x*ln(x) - sqrt(x)

The derivative of this function is f'(x) = ln(x) + 1 - 1/(2*sqrt(x)). Setting this equal to zero does not yield a solution that can be expressed in terms of elementary functions. However, we can easily apply the gradient descent method to find a numerical solution.

*********************************

it also generated a bunch of example code in Python. Curious as to the validity of the response.