this post was submitted on 12 Nov 2023
1 points (100.0% liked)

Machine Learning

1 readers
1 users here now

Community Rules:

founded 11 months ago
MODERATORS
 

I'm looking for some advice regarding a project idea I have. I would like to predict the big five personality traits for authors based on an analysis of their writing samples. However, would I need to have had some authors take the big five personality assessment and have a training set with those results in order to do a project like this? Or is their a way to "guess" what certain writing patterns would correlate with? What would be the potential strategy for orienting an ml project like this?

top 4 comments
sorted by: hot top controversial new old
[–] Veggies-are-okay@alien.top 1 points 10 months ago

Since it’s writing style, it’s unstructured data (as opposed to tabular) and therefore a neural network is the best option. Because you’re looking at text, you have two options:

  1. theoretical: rnn -> lstm -> transformer

More so if you’re into the inner workings. Recursive neural networks bring in the concept of recursion, lstm (long short term memory) gives you more power (but a little more complicated), and finally transformers have the fun encoder/decoder features built in to make a super-powered lstm.

  1. huggingface! For simple classification from text this is gonna be real easy and pretty effective:

https://huggingface.co/bert-base-cased

The big thing here is how are you going to fine tune it? You’ll need some classification outcomes to attach to your samples. Because the traits aren’t mutually exclusive, you might want to make a few binary classifiers (yes/no for a specific trait). The link has some examples of fine tuning too.

Hope this gets you off to a decent start!

[–] sshh12@alien.top 1 points 10 months ago (2 children)

Definitely one tricky part as you mentioned is the dataset. In an ideal world, you'll have a supervised dataset of (document, personality type) pairs and you can train a model on these (just like u/Veggies-are-okay mentioned).

Assuming you don't have this data, a couple options:

  • Make the data. Some quick google searches show that many celebrities do have known Big-5s. You could manually curate Big-5s and text written by these celebrities to build these pairs.
  • Use synthetic data. Try asking an LLM (like ChatGPT) to write a text on a random topic as if they were $RANDOM big-5 then just use these results as your training pairs.
  • Try clustering. Potentially similar personality types have similar embeddings. Take a dataset of writings, embed them using something like BERT, label/best-effort-guess a few and then predict personalities based on the proximity of a piece of known big-5 text in the embedding space. You could extend this to training a model that asks "do text A and text B display the same big-5" which could potentially be an easier problem to get samples for and then run this model against a set of know big-5s and your unknown example.
  • Use a proxy. There might be datasets/models out there that predict heuristics that could be combined to find big 5. Like maybe a sentiment score is correlated with agreeableness. Potentially you might be able to create word/phrase banks such that using certain phrases is potentially indicative of a leaning on big-5 ("has_neurotic_phrases" is then a feature in your model)
[–] Veggies-are-okay@alien.top 1 points 10 months ago

Will back up the ChatGPT advice. It’s really amazing how much value LLMs can have in creating synthetic data.

[–] gettotea@alien.top 1 points 10 months ago

Really great ideas. The synthetic data one is worth going deeper into.