1
Hi guys, I am asked to try and fit a ML model to a huge mobility dataset that i have, and I tried some models but fail to get a decent [performance metric], so i'd love a fresh set of eyes on this!
Features of the dataset
- Each row represents the data for a certain "Origin-Destination" pair, for example pair "31 to 493" meaning this is from place 31 to place 493. The first feature is thus called pair.
- For each mode of transport (drive, bike, walk, transit) there are 3 "cost"-features namely: [mode]_time, [mode]_cost, [mode]_convenience. So there are 12 features in total (4 modes x 3 costs)
- Some extra features: average_income, cars_per_household, jobs_at_destination (representing the people travel in this pair
- 4 observed features, one for each mode. These are the features to predict. This is a value between 0 and 1, representing how much % of people in this pair, use this mode of transport.
Additional information
- sometimes the 3 costs for "transit" are 999, meaning that there is no transit option (train, tram, ...) available for this pair. The usual costs lie between 0 and 100
- I deleted the walk_cost feature because every entry was 0.
- Here are the distributions of all the features:
- And the correlation matrix:
So the goal is to predict those 4 obsrv features. I am very curious which ML model you would use for this and why?
If you have any other suggestions, e.g. pre-processing techniques on the data/features, do share!
Thank you guys!
Can you provide more detail on the 'place' ids
Are they related in any way
Is it just zip codes? Or types of location (home/work/gym/restaurant)
Basically you need to make the similarities explicit
Depending on what exactly these are will depend on how you can make them explicit
Are you trying to predict a pair of unknown place ids or each place I'd is known, but not that particular combination
The pairs are literally just codes. They don’t have any numeric meaning or anything. It just represents a certain trajectory (2 locations)