this post was submitted on 15 Nov 2023
1 points (100.0% liked)

Machine Learning

1 readers
1 users here now

Community Rules:

founded 1 year ago
MODERATORS
 

Hi guys, I am asked to try and fit a ML model to a huge mobility dataset that i have, and I tried some models but fail to get a decent [performance metric], so i'd love a fresh set of eyes on this!

Features of the dataset

  • Each row represents the data for a certain "Origin-Destination" pair, for example pair "31 to 493" meaning this is from place 31 to place 493. The first feature is thus called pair.
  • For each mode of transport (drive, bike, walk, transit) there are 3 "cost"-features namely: [mode]_time, [mode]_cost, [mode]_convenience. So there are 12 features in total (4 modes x 3 costs)
  • Some extra features: average_income, cars_per_household, jobs_at_destination (representing the people travel in this pair
  • 4 observed features, one for each mode. These are the features to predict. This is a value between 0 and 1, representing how much % of people in this pair, use this mode of transport.

Additional information

  • sometimes the 3 costs for "transit" are 999, meaning that there is no transit option (train, tram, ...) available for this pair. The usual costs lie between 0 and 100
  • I deleted the walk_cost feature because every entry was 0.
  • Here are the distributions of all the features:

https://preview.redd.it/8lt001so6k0c1.png?width=2002&format=png&auto=webp&s=adfe645606746c008941e36fbb35261d0600a8bb

https://preview.redd.it/yeqx4yro6k0c1.png?width=2046&format=png&auto=webp&s=011d05795c203d57d9402805377b0e8741c2378c

https://preview.redd.it/7dtk0gso6k0c1.png?width=2042&format=png&auto=webp&s=6ef5ed2d5f9044dfede43a765ae19afb5c487486

  • And the correlation matrix:

https://preview.redd.it/61uwu9fx6k0c1.png?width=1718&format=png&auto=webp&s=d299ca5ffa97dadd45804a62ed2b17628f51d0f9

So the goal is to predict those 4 obsrv features. I am very curious which ML model you would use for this and why?

If you have any other suggestions, e.g. pre-processing techniques on the data/features, do share!

Thank you guys!

top 3 comments
sorted by: hot top controversial new old
[–] seanv507@alien.top 1 points 11 months ago (1 children)

Can you provide more detail on the 'place' ids

Are they related in any way

Is it just zip codes? Or types of location (home/work/gym/restaurant)

Basically you need to make the similarities explicit

Depending on what exactly these are will depend on how you can make them explicit

Are you trying to predict a pair of unknown place ids or each place I'd is known, but not that particular combination

[–] Zijdehoen@alien.top 1 points 11 months ago

The pairs are literally just codes. They don’t have any numeric meaning or anything. It just represents a certain trajectory (2 locations)

[–] No-Painting-3970@alien.top 1 points 11 months ago

Boost boost boost. It is tabular, so i wouldnt look further from xgboost, lightgbm etc