this post was submitted on 19 Nov 2023

1 points (100.0% liked)

Machine Learning

1 readers

1 users here now

Community Rules:

Be nice. No offensive behavior, insults or attacks: we encourage a diverse community in which members feel safe and have a voice.
Make your post clear and comprehensive: posts that lack insight or effort will be removed. (ex: questions which are easily googled)
Beginner or career related questions go elsewhere. This community is focused in discussion of research and new projects that advance the state-of-the-art.
Limit self-promotion. Comments and posts should be first and foremost about topics of interest to ML observers and practitioners. Limited self-promotion is tolerated, but the sub is not here as merely a source for free advertisement. Such posts will be removed at the discretion of the mods.

founded 2 years ago

MODERATORS

communick@academy.garden

[D] Simplest mathematical example of a function that can only be solved by gradient descent (alien.top)

submitted 1 year ago by EatMeMonster@alien.top to c/machinelearning@academy.garden

25 comments fedilink hide all child comments

I'm trying to teach a lesson on gradient descent from a more statistical and theoretical perspective, and need a good example to show its usefulness.

What is the simplest possible algebraic function that would be impossible or rather difficult to optimize for, by setting its 1st derivative to 0, but easily doable with gradient descent? I preferably want to demonstrate this in context linear regression or some extremely simple machine learning model.

top 25 comments

sorted by: hot top controversial new old

[–] ObjectiveMountain900@alien.top 1 points 1 year ago (1 children)

Linear regression with a L1 regularization term is a great example. While in some cases a closed form exists (if we only have a single predictor, or if X is orthonormal), generally gradient descent is used.

[–] charlesGodman@alien.top 1 points 1 year ago

To be precise: linear regression with L2 loss and L1 regularization. L1 regression with L1 penalty can be solved using a linear program.

[–] muggafugga@alien.top 1 points 1 year ago

Here's an example of fitting a 3rd order polynomial I worked with recently that might be useful for you. The code is largely part of the pytorch documentation, with an animation to illustrate the process.

https://gist.github.com/jlines/00eae9590d26b31cb98b333e537fcb31

[–] TheGuywithTehHat@alien.top 1 points 1 year ago (1 children)

abs(x)?

[–] Inside_Tangerine_784@alien.top 1 points 1 year ago

Abs is an example of a function for which gradient descent does not work. No matter how small the step, it oscillate near zero.

[–] lacker@alien.top 1 points 1 year ago

A simple example is when you have a set of n points, and you want to find a local maximum of “distance away from the closest point in the set”. The gradient is easy to approximate, and you can show visually what’s happening to students. There are algorithms to solve this exactly but they are much more complicated than gradient descent.

For one step more complexity you could do an svm algorithm. Ie, divide points into red and blue, and find the line that separates them into two groups, maximizing the margin to the closest point. This is a real problem for which gradient descent (well, some variant of it) is the best solution. It’s also nicely visualizable.

[–] idkname999@alien.top 1 points 1 year ago (2 children)

Logistic Regression is a simple machine learning model with no closed form solution.

[–] ravshanbeksk@alien.top 1 points 1 year ago (1 children)

You can write a log likelihood function and take its derivative and set it equal to zero.

[–] DoctorFuu@alien.top 1 points 1 year ago

no, to convince yourself, just try to do it.

[–] neyman-pearson@alien.top 1 points 1 year ago

Why is this not the most upvoted answer?

[–] Ok_Reality2341@alien.top 1 points 1 year ago

6th degree polynomial?

[–] lithiumdeuteride@alien.top 1 points 1 year ago (3 children)

How about this function:

f(x) = abs(x) - cos(x)

Setting the first derivative equal to zero leads to trouble, as the derivative doesn't exist at the minimum, and there are infinitely many points which have zero slope. Nevertheless, the function has a clear minimum which gradient descent of any finite step size should find.

[–] neyman-pearson@alien.top 1 points 1 year ago (1 children)

Except you could get stuck at an arbitary saddle point

[–] DoctorFuu@alien.top 1 points 1 year ago (1 children)

Isn't the probability of getting exactly on one 0 though?

[–] jamochasnake@alien.top 1 points 1 year ago

of course, every locally lipschitz function is differentiable almost everywhere, i.e., the set of points where it is not differentiable is a measure zero set. So if you drew a point randomly (with distribution absolutely continuous w.r.t. the lebesgue measure) then you avoid such points with probability. However, this has NOTHING to do with actually performing an algorithm like gradient descent - in these algorithms you are NOT simply drawing points randomly and you cannot guarantee you avoid these points a priori.

[–] yannbouteiller@alien.top 1 points 1 year ago

It does answer OP's question, but is of limited practical relevance for an ML course IMHO.

We typically use GD in approximately pseudoconvex optimization landscapes, not because there are infinitely many or even any single saddle point. To escape local optima and saddle points, we rely on other tricks like SGD.

[–] jamochasnake@alien.top 1 points 1 year ago

ignoring the fact that you can't prove anything about gradient descent for nonsmooth functions (you need subgradients), a finite step size will NOT work. You can end up at the minimum and compute a subgradient which is different from zero there, and end up moving AWAY from the minimizer. You need to decay your step size to ensure that you end up at the minimizer (asymptotically).

[–] ravshanbeksk@alien.top 1 points 1 year ago

Give an example of linear regression that does not support zero conditional mean assumption. Like auto regressive process. You cannot take derivative then because it is not differentiable. However gradient descent doesn't care about all that

[–] un_om_de_cal@alien.top 1 points 1 year ago

Functions that are exponential + polynomial, or trigonometric function + polynomial in general do not have analytical solutions for finding the roots.

So you could use "minimize f(x)=e^x\ +x^2\ " as an example.

[–] wojcech@alien.top 1 points 1 year ago

Go back to your history: Cauchy is the earliest person I'm aware of to have used gradient descent, and he motivated it as

one ordinarily starts by reducing them to a single one by successive eliminations, to eventually solve for good the resulting equation, if possible. But it is important to observe that 1◦ in many cases, the elimination cannot be performed in any way; 2◦ the resulting equation is usually very complicated, even though the given equations are rather simple

That is, the usefulness of gradient descent is motivated when you have rough idea of when you are close to the minimum, but you don't want to go through the hassle of algebra. (realistically, if you can solve it with gradient descent, you could probably solve it algebraicly, we just don't have the same stupidly easy to implement computational routines for it)

https://www.math.uni-bielefeld.de/documenta/vol-ismp/40_lemarechal-claude.pdf

[–] tarsiospettro@alien.top 1 points 1 year ago (1 children)

I do not understand most comments here.

Gradient descent is just the same as the tangent method. It is ubiquitously used, e.g. find minimum of whatever polynomial of degree >= 4.

Calculating derivative and finding 0 of the derivative is still the same problem. You look for a numerical solution using gradient descents. Bisection is slower and not so effective for multivariate functions.

I would say the opposite: there are more optimisation problems where gradient descent is used than not (excluding everything which can be solved by linear systems)

[–] DoctorFuu@alien.top 1 points 1 year ago

This reminds me that I still didn't read the paper on forward-forward algorithm and thus I'm not even sure if it's still gradient descent.

[–] Hothapeleno@alien.top 1 points 1 year ago

Find the minimum or maximum of an n-dimensional matrix of data instead of a function. Generate the data with any nice to visualise function. The move on from gradient descent to simulated annealing.

[–] howtorewriteaname@alien.top 1 points 1 year ago

the error function

[–] hank-particles-pym@alien.top 1 points 1 year ago

Curious about this answer, I use AI to help me understand a lot of what happens with questions like these, but as its not my expertise, I am always suspect of the answers i get. So I sent your question to Phind:

A good example of a function that is difficult to optimize by setting its first derivative to zero, but can be easily optimized using gradient descent, is the function f(x) = x*ln(x) - sqrt(x)

The derivative of this function is f'(x) = ln(x) + 1 - 1/(2*sqrt(x)). Setting this equal to zero does not yield a solution that can be expressed in terms of elementary functions. However, we can easily apply the gradient descent method to find a numerical solution.

*********************************

it also generated a bunch of example code in Python. Curious as to the validity of the response.