Machine Learning

1 readers

1 users here now

Community Rules:

Be nice. No offensive behavior, insults or attacks: we encourage a diverse community in which members feel safe and have a voice.
Make your post clear and comprehensive: posts that lack insight or effort will be removed. (ex: questions which are easily googled)
Beginner or career related questions go elsewhere. This community is focused in discussion of research and new projects that advance the state-of-the-art.
Limit self-promotion. Comments and posts should be first and foremost about topics of interest to ML observers and practitioners. Limited self-promotion is tolerated, but the sub is not here as merely a source for free advertisement. Such posts will be removed at the discretion of the mods.

founded 1 year ago

MODERATORS

communick@academy.garden

[D] Sport game prediction (alien.top)

submitted 1 year ago by exater@alien.top to c/machinelearning@academy.garden

2 comments fedilink hide all child comments

I have quite a large dataset of historical games for a sport. Generally speaking what is the best way to predict the winner of these games?

Currently I have a program transforming every game into a bunch of features (participant ages on the day, their wins, stats at the time, etc) and this outputs a binary value whether team 0 or team 1 wins. I guess my questions are:

Generally speaking when training a complex model for something like game predictions where its hard to determine whether a parameter is particularly useful or not, is it better to just have as many parameters as possible? Or is it possible that too many can be detrimental. For example, I could have a single parameter for “career minutes played”. Or would it be more effective to have the career minutes played and also career minutes played for every quarter because players could have varying experience in certain times of the game
What kind of model architecture is generally perceived as the best for something like this where we have 100s of input parameters all boiling down to probabilities for the outcome being 0 or 1? Currently I am trying to use both random forest classification and feed forward neural nets. If neural networks are the avenue I should pursue, is it generally agreed upon that bigger is better for FNNs? More hidden layers? Larger hidden layers?

top 2 comments

sorted by: hot top controversial new old

[–] Ty4Readin@alien.top 1 points 1 year ago

Couple of things to break down here.

You call them "parameters" but we would normally call those "features", just a small note.

Your two questions are pretty similar:

Q1. Is it better to add more features or less features?

Q2. Is it better to have a more complex/larger model or simpler/smaller model (like a neural network)?

The answer to both is: it depends!

When you add more features and make your model larger/more complex, then that means your model will be able to capture more complex patterns which could be beneficial or could be harmful!

You should read up on overfitting vs underfitting error. Generally speaking, you can reduce underfitting error by adding features and increase model complexity but that comes with the trade-off of increasing overfitting error usually.

The question then becomes: is the gain in underfitting error outweighing the loss in overfitting error?

The only way to know for sure is usually to test out both approaches on a validation set and choose the model and feature set that performed best.

[–] DatYungChebyshev420@alien.top 1 points 1 year ago

When I do sports analysis, xgboost , elastic nets, and MaRS models are my friends. Stack a few together. Tune them well.

Sports data is usually as structured and clean as anything in the world, so I don’t think a big neural network will be necessary or helpful.

Lastly, I recommend modeling the proportion of points scored by the home team rather than winner/loser as a binary outcome, as this is more informative.

I recommend starting with as many variables as you can, fitting your model, and seeing how many variables you can cut out before your cross-validated performance starts dropping substantially.