LocalLLaMA

3 readers

1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 1 year ago

MODERATORS

communick@poweruser.forum

which is the best model (finetuned or base) to extract structured data from a bunch of text? (alien.top)

submitted 11 months ago by sandys1@alien.top to c/localllama@poweruser.forum

6 comments fedilink hide all child comments

hi folks,

simple question really - what model (finetuned or otherwise) have you found that can extract data from a bunch of text.

I'm happy to finetune, so if there are any successes there, would really appreciate some pointers in the right direction.

Really looking for a starting point here. I'm aware of the DETR class of models and how Microsoft trained table-transformers on DETR. Wondering if that can be done on llama2,etc models ?

P.S. cannot use GPT because of sensitive PII data.

top 6 comments

sorted by: hot top controversial new old

[–] georgejrjrjr@alien.top 1 points 11 months ago (1 children)

I’ve wondered this, and hope you get better answers.

One thing you could do if it fit your use-case: align GDELT entries and news stories in realnews dataset on huggingface, train a model to output the extracted info from the article.

Another is have GPT-4 so some examples on lightly faked / anonymized data and then distill that into a model that does well on information extraction evals (which are a thing iirc).

[–] sandys1@alien.top 1 points 11 months ago

What is the information extraction evals ? Do u have a link ?

[–] fediverser@alien.top 1 points 11 months ago

This post is an automated archive from a submission made on /r/LocalLLaMA, powered by Fediverser software running on alien.top. Responses to this submission will not be seen by the original author until they claim ownership of their alien.top account. Please consider reaching out to them let them know about this post and help them migrate to Lemmy.

Lemmy users: you are still very much encouraged to participate in the discussion. There are still many other subscribers on !localllama@poweruser.forum that can benefit from your contribution and join in the conversation.

Reddit users: you can also join the fediverse right away by getting by visiting https://portal.alien.top. If you are looking for a Reddit alternative made for and by an independent community, check out Fediverser.

[–] Iamisseibelial@alien.top 1 points 11 months ago (1 children)

If sensitive why not Claude to get the baseline of what you want // examples? Since they are SOC2 // HIPAA unless you're dealing with national security stuff you should be good to go there. And get enough examples done to train a specialized model.

[–] sandys1@alien.top 1 points 11 months ago (1 children)

Has nothing to do with national security. It has to do with audit and compliance. Soc2 and HIPAA are not the only compliance artifacts out there. There are multiple (including cross national ones like Singapore PDP, etc).

This is why OpenAI was FORCED to offer custom model as a service.

Again, i don't want this thread to devolve into a regulatory debate...but I have fought large extended battles in court on these topics : these things are not possible.

[–] Iamisseibelial@alien.top 1 points 11 months ago

Ohh that's absolutely fair, especially when dealing with Singapore, SK or Japan. APPI AND PIPA are a pain in the ass to deal with. That said making fake versions of the data for examples is likely the best route to actually be able to train your own model then.