this post was submitted on 12 Nov 2023
1 points (100.0% liked)

Machine Learning

1 readers
1 users here now

Community Rules:

founded 11 months ago
MODERATORS
 

I'm looking for some advice regarding a project idea I have. I would like to predict the big five personality traits for authors based on an analysis of their writing samples. However, would I need to have had some authors take the big five personality assessment and have a training set with those results in order to do a project like this? Or is their a way to "guess" what certain writing patterns would correlate with? What would be the potential strategy for orienting an ml project like this?

you are viewing a single comment's thread
view the rest of the comments
[–] sshh12@alien.top 1 points 10 months ago (2 children)

Definitely one tricky part as you mentioned is the dataset. In an ideal world, you'll have a supervised dataset of (document, personality type) pairs and you can train a model on these (just like u/Veggies-are-okay mentioned).

Assuming you don't have this data, a couple options:

  • Make the data. Some quick google searches show that many celebrities do have known Big-5s. You could manually curate Big-5s and text written by these celebrities to build these pairs.
  • Use synthetic data. Try asking an LLM (like ChatGPT) to write a text on a random topic as if they were $RANDOM big-5 then just use these results as your training pairs.
  • Try clustering. Potentially similar personality types have similar embeddings. Take a dataset of writings, embed them using something like BERT, label/best-effort-guess a few and then predict personalities based on the proximity of a piece of known big-5 text in the embedding space. You could extend this to training a model that asks "do text A and text B display the same big-5" which could potentially be an easier problem to get samples for and then run this model against a set of know big-5s and your unknown example.
  • Use a proxy. There might be datasets/models out there that predict heuristics that could be combined to find big 5. Like maybe a sentiment score is correlated with agreeableness. Potentially you might be able to create word/phrase banks such that using certain phrases is potentially indicative of a leaning on big-5 ("has_neurotic_phrases" is then a feature in your model)
[–] gettotea@alien.top 1 points 10 months ago

Really great ideas. The synthetic data one is worth going deeper into.

[–] Veggies-are-okay@alien.top 1 points 10 months ago

Will back up the ChatGPT advice. It’s really amazing how much value LLMs can have in creating synthetic data.