this post was submitted on 16 Nov 2023
1 points (100.0% liked)
Machine Learning
1 readers
1 users here now
Community Rules:
- Be nice. No offensive behavior, insults or attacks: we encourage a diverse community in which members feel safe and have a voice.
- Make your post clear and comprehensive: posts that lack insight or effort will be removed. (ex: questions which are easily googled)
- Beginner or career related questions go elsewhere. This community is focused in discussion of research and new projects that advance the state-of-the-art.
- Limit self-promotion. Comments and posts should be first and foremost about topics of interest to ML observers and practitioners. Limited self-promotion is tolerated, but the sub is not here as merely a source for free advertisement. Such posts will be removed at the discretion of the mods.
founded 2 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
hi, data monger here :D. I'm assuming you need procurement of data rather than using synthetic ones or ones already exist (internally or online).
first step is having a good judgement on what kind of ML model you would use (or if ML is the right approach to begin with). for instance, simple tasks can be done with decision trees and nearest-neighbor, more complex tasks might require some fine-tuning of existing model you're pulling on huggingface. what's the simplest one for the job that you can get away with?
once the choice of model is settled, you must estimate the quantity of the data. a simpler model can be fitted with 200 data points, an expressive model might need 100k-ish tokens. so how much does it cost to get one data point? this is the crux of your work as a data curator.
to do that, you need to create an annotation interface. the annotators could be in-house, it could be through contractors, or it could be crowd-sourced via a website. you'll wind up spending a good chunk of time getting the UX smooth
it does NOT need to look pretty, ur paying ppl to do work with this interface, but it NEEDS to be clear and at no point should an annotator get "lost" on what to do. every design decision in your annotation UX you make will translate multiplicatively to your overall labeling cost. so iterate relentlessly, with 2 participants, watch over them, then with 4, 8, etc.
Once your UX is ironed out, and you've pilot tested it on few participants, you will have an _accurate cost guesstimate_ , where you simply look at how many data point you've got in your pilot study, multiplied by average time it took to procure each data point. With this estimate, you will (as a competent ML person you are) have a sense of how good your ML model will perform once trained on a comparatively larger scale set of data of this quality. You'll get a number, it could be $1000, or $100k. Then you need to figure out the finances around how to get the dataset finally out of the way.
hope this helped ! it is very "dirty" work but extremely powerful if done right.