Machine Learning

1 readers

1 users here now

Community Rules:

Be nice. No offensive behavior, insults or attacks: we encourage a diverse community in which members feel safe and have a voice.
Make your post clear and comprehensive: posts that lack insight or effort will be removed. (ex: questions which are easily googled)
Beginner or career related questions go elsewhere. This community is focused in discussion of research and new projects that advance the state-of-the-art.
Limit self-promotion. Comments and posts should be first and foremost about topics of interest to ML observers and practitioners. Limited self-promotion is tolerated, but the sub is not here as merely a source for free advertisement. Such posts will be removed at the discretion of the mods.

founded 2 years ago

MODERATORS

communick@academy.garden

[D] What ML model to use for this mobility problem? (alien.top)

submitted 2 years ago by Zijdehoen@alien.top to c/machinelearning@academy.garden

3 comments fedilink hide all child comments

Hi guys, I am asked to try and fit a ML model to a huge mobility dataset that i have, and I tried some models but fail to get a decent [performance metric], so i'd love a fresh set of eyes on this!

Features of the dataset

Each row represents the data for a certain "Origin-Destination" pair, for example pair "31 to 493" meaning this is from place 31 to place 493. The first feature is thus called pair.
For each mode of transport (drive, bike, walk, transit) there are 3 "cost"-features namely: [mode]_time, [mode]_cost, [mode]_convenience. So there are 12 features in total (4 modes x 3 costs)
Some extra features: average_income, cars_per_household, jobs_at_destination (representing the people travel in this pair
4 observed features, one for each mode. These are the features to predict. This is a value between 0 and 1, representing how much % of people in this pair, use this mode of transport.

Additional information

sometimes the 3 costs for "transit" are 999, meaning that there is no transit option (train, tram, ...) available for this pair. The usual costs lie between 0 and 100
I deleted the walk_cost feature because every entry was 0.
Here are the distributions of all the features:

https://preview.redd.it/8lt001so6k0c1.png?width=2002&format=png&auto=webp&s=adfe645606746c008941e36fbb35261d0600a8bb

https://preview.redd.it/yeqx4yro6k0c1.png?width=2046&format=png&auto=webp&s=011d05795c203d57d9402805377b0e8741c2378c

https://preview.redd.it/7dtk0gso6k0c1.png?width=2042&format=png&auto=webp&s=6ef5ed2d5f9044dfede43a765ae19afb5c487486

And the correlation matrix:

https://preview.redd.it/61uwu9fx6k0c1.png?width=1718&format=png&auto=webp&s=d299ca5ffa97dadd45804a62ed2b17628f51d0f9

So the goal is to predict those 4 obsrv features. I am very curious which ML model you would use for this and why?

If you have any other suggestions, e.g. pre-processing techniques on the data/features, do share!

Thank you guys!

top 3 comments

sorted by: hot top controversial new old

[–] seanv507@alien.top 1 points 2 years ago (1 children)

Can you provide more detail on the 'place' ids

Are they related in any way

Is it just zip codes? Or types of location (home/work/gym/restaurant)

Basically you need to make the similarities explicit

Depending on what exactly these are will depend on how you can make them explicit

Are you trying to predict a pair of unknown place ids or each place I'd is known, but not that particular combination

[–] Zijdehoen@alien.top 1 points 2 years ago

The pairs are literally just codes. They don’t have any numeric meaning or anything. It just represents a certain trajectory (2 locations)

[–] No-Painting-3970@alien.top 1 points 2 years ago

Boost boost boost. It is tabular, so i wouldnt look further from xgboost, lightgbm etc