LocalLLaMA

14 readers

1 users here now

Community to discuss about Llama, the family of large language models created by Meta AI.

founded 2 years ago

MODERATORS

What do you think about GPT-isms polluting datasets? Do you consider them a problem? If so, how big of a problem do you think it is? (alien.top)

submitted 2 years ago by OC2608@alien.top to c/localllama@poweruser.forum

5 comments fedilink hide all child comments

It's no secret that many language models and fine-tunes are trained using datasets, many of them are made using GPT models. The problem arises when many "GPT-isms" end up in the dataset. And I am not only referring to the typical expressions like "however, it's important to...", "I understand your desire to...", but I am also referring to the structure of the outputs in the model's responses. ChatGPT (GPT models in general) tend to have a very predictable structure when in its "soulless assistant" mode, which makes it very easy to say "this is very GPT-like".

What do you think about this? Oh, and by the way, forgive my English.

you are viewing a single comment's thread
view the rest of the comments

[–] noeda@alien.top 1 points 2 years ago

I think the GPT-isms maybe why my AI storywriting attempts tend to be overly positive and cliched. Not exactly a world shattering problem but it is annoying shakes fist.

I think if I thought a possible serious problem, it's that the biases that OpenAI initially inserted into ChatGPT and their GPT models now spread around the local models as well.

It's annoying because it feels like all models respond to questions in a similar way. Some are just a bit smarter than others or tuned to respond a bit differently.

If the GPT-like data spreads around Internet as well then it might be difficult to avoid having it in training data unless you only include old data in your training.