this post was submitted on 27 May 2024
1101 points (98.0% liked)
Technology
59219 readers
3145 users here now
This is a most excellent place for technology news and articles.
Our Rules
- Follow the lemmy.world rules.
- Only tech related content.
- Be excellent to each another!
- Mod approved content bots can post up to 10 articles per day.
- Threads asking for personal tech support may be deleted.
- Politics threads may be removed.
- No memes allowed as posts, OK to post as comments.
- Only approved bots from the list below, to ask if your bot can be added please contact us.
- Check for duplicates before posting, duplicates may be removed
Approved Bots
founded 1 year ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
It's quite simple. Garbage in, garbage out. Data they use for training needs to be curated. How to curate the entire internet, I have no clue.
The real answer would be "don't". Have a decent whitelist dor training data with reliable data. Don't just add every orifice of the internet (like reddit) to the training data. Limitations would be good in this case.
Its worse than reddit, they've been pulling data from the onion.
Is that for real?
Its been quoting some onion articles verbatim, so either they pulled from the onion directly or from somewhere that re-posts onion articles.
Just train it on linux help forum replies, because everyone there is always 100% right.
Having a curated whitelist would definitely be a good idea, but if it only shows information from a limited list of websites, that would make it a terrible search engine incapable of searching most of the web.
They already have a curated data set. It's called Google Scholar.