Machine Learning

1 readers

1 users here now

Community Rules:

Be nice. No offensive behavior, insults or attacks: we encourage a diverse community in which members feel safe and have a voice.
Make your post clear and comprehensive: posts that lack insight or effort will be removed. (ex: questions which are easily googled)
Beginner or career related questions go elsewhere. This community is focused in discussion of research and new projects that advance the state-of-the-art.
Limit self-promotion. Comments and posts should be first and foremost about topics of interest to ML observers and practitioners. Limited self-promotion is tolerated, but the sub is not here as merely a source for free advertisement. Such posts will be removed at the discretion of the mods.

founded 2 years ago

MODERATORS

communick@academy.garden

[Discussion] What curve can I fit or what model can I train? (alien.top)

submitted 2 years ago by ninadsutrave@alien.top to c/machinelearning@academy.garden

10 comments fedilink hide all child comments

I have a dataset of two column values something like the one shown below. I need to predict the values of y for values of x greater than 60. The curve must follow the increasing trend it is shown till x=60.

I have tried polynomial regression and SVR but it declines for values greater than 60. I have tried to fit the curve y = alnx + b to this curve but the R2score is 0.94. What model can I train for this purpose, or how can I improve the R2score but regressing over an appropriate logarithmic function?

https://preview.redd.it/f9oxc20zga2c1.png?width=1208&format=png&auto=webp&s=b7918c9d9dd2bb930a2e903483d5a230f2dcfce5

top 10 comments

sorted by: hot top controversial new old

[–] Zahlii@alien.top 1 points 2 years ago

A plot with the fitted ln curve overlay would be helpful to check goodness of fit. It looks like 30-45 is actually somewhat linear and not ln, and without domain knowledge about what x and y are I find it difficult to propose curve fits about what looks to be a piecewise function

[–] JPyoris@alien.top 1 points 2 years ago

Not a direct answer, but be aware that overfitting will be a thing here too. You might get an R2 of 0.99+ but the extrapolation could be horrendous (for example, using a high-degree polynomial, you already saw that). 0.94 with only two parameters does not sound too bad for me.

Maximizing R2 and eyeballing the extrapolation is not really a valid approach. You should use a goodness of fit test that includes model complexity. You could also implement a simple validation by leaving out the last x% of your data when fitting and then look at the test error.

I also have to agree that it looks somewhat piecewise. Without knowing the generating process the correct continuation could be anything.

[–] DatYungChebyshev420@alien.top 1 points 2 years ago

Your problem begs for smoothing splines.

for a continuous univariate relationship -you literally cannot do better than smoothing splines.

no function will ever, EVER provide a better fit to your data than a smoothing spline. Some cool theory is behind this.

[–] PM_ME_YOUR_BAYES@alien.top 1 points 2 years ago (1 children)

The issue here is that you want to extrapolate values outside of the training set (for x>60). You can even get to 0 error, R2=1 on the training data, but it would be meaningless, because you are going to predict outside of this range. If you don't have data for the range that interests you the best thing you could do is to rely on domain knowledge.

For example, if you have reason to believe that the function is going to approach an asymptote, you can exploit this knowledge by limiting the class of fitting functions to e.g. parametric sigmoids.

Or if you know that the process you are modeling has a specific functional type, like logarithmic or squate root, then limit the function space accordingly.

If you have any other kind of knowledge about your function, it could be used as a prior distribution in a bayesian approach, like bayesian regression or gaussian process

Bottom line is, there is no magic button "make it work" i ml/statistical modeling, you have to embed your domain knowledge in. The modeling process is not a blind one.

[–] ninadsutrave@alien.top 1 points 2 years ago (1 children)

Can't the training set provide data to how the curve seems to be rising (the change in value y for every corresponding change in value x)? And this change is carried forward to all future values of x to go with the trend and obtain the predicted results?

Thinking out loud with me intuition here. Is there is any model that resembles the above logic?

[–] PM_ME_YOUR_BAYES@alien.top 1 points 2 years ago

That can be informative, but as I was saying, you have to limit the function space to those compatible with your hypothesis.

I repeat my question in a clearer way: do you know (or have a guess of) what the function would look like after x=60?

Since you mentioned the rate of change, have you ever plotted the numerical derivative of this function? Maybe it's shape has a recognizable shape that could help you in identifying the right class

[–] sumching@alien.top 1 points 2 years ago

Try grokking

[–] MrShizzo@alien.top 1 points 2 years ago

What i would do, and i dont really have a clue:

Plot points not a line, you dont know whats between the points, so your line is just an assumtion.

Cut away the first few points to get rid of this strange jump

If there is a physical theorie behind, use it

If you want to do ML, use 10..20% of your points. Fit you basic function, compare with rest of datapoints.

[–] departedmessenger@alien.top 1 points 2 years ago

It looks like you have some bad data at the beginning of the curve.

[–] Altruistic-Skill8667@alien.top 1 points 2 years ago

If you need extrapolation: Maybe try a symbolic regressor like gplearn. It tries out different combinations of functions based an a genetic algorithm from simple to complex. You can also set the allowed functions. I have never tried it though.

Or maybe a smoothing spline. Those can also extrapolate. Maybe LQSUnivariateSpline from scipy. There you can set the anchor points, which would probably allow you to get a better fit with less parameters (The fewer, the better it extrapolates).