Machine Learning

1 readers

1 users here now

Community Rules:

Be nice. No offensive behavior, insults or attacks: we encourage a diverse community in which members feel safe and have a voice.
Make your post clear and comprehensive: posts that lack insight or effort will be removed. (ex: questions which are easily googled)
Beginner or career related questions go elsewhere. This community is focused in discussion of research and new projects that advance the state-of-the-art.
Limit self-promotion. Comments and posts should be first and foremost about topics of interest to ML observers and practitioners. Limited self-promotion is tolerated, but the sub is not here as merely a source for free advertisement. Such posts will be removed at the discretion of the mods.

founded 2 years ago

MODERATORS

communick@academy.garden

[Project] Big 5 Personality Project Question (alien.top)

submitted 2 years ago by OpenJuggernaut8556@alien.top to c/machinelearning@academy.garden

4 comments fedilink hide all child comments

I'm looking for some advice regarding a project idea I have. I would like to predict the big five personality traits for authors based on an analysis of their writing samples. However, would I need to have had some authors take the big five personality assessment and have a training set with those results in order to do a project like this? Or is their a way to "guess" what certain writing patterns would correlate with? What would be the potential strategy for orienting an ml project like this?

top 4 comments

sorted by: hot top controversial new old

[–] Veggies-are-okay@alien.top 1 points 2 years ago

Since it’s writing style, it’s unstructured data (as opposed to tabular) and therefore a neural network is the best option. Because you’re looking at text, you have two options:

theoretical: rnn -> lstm -> transformer

More so if you’re into the inner workings. Recursive neural networks bring in the concept of recursion, lstm (long short term memory) gives you more power (but a little more complicated), and finally transformers have the fun encoder/decoder features built in to make a super-powered lstm.

huggingface! For simple classification from text this is gonna be real easy and pretty effective:

https://huggingface.co/bert-base-cased

The big thing here is how are you going to fine tune it? You’ll need some classification outcomes to attach to your samples. Because the traits aren’t mutually exclusive, you might want to make a few binary classifiers (yes/no for a specific trait). The link has some examples of fine tuning too.

Hope this gets you off to a decent start!

[–] sshh12@alien.top 1 points 2 years ago (2 children)

Definitely one tricky part as you mentioned is the dataset. In an ideal world, you'll have a supervised dataset of (document, personality type) pairs and you can train a model on these (just like u/Veggies-are-okay mentioned).

Assuming you don't have this data, a couple options:

Make the data. Some quick google searches show that many celebrities do have known Big-5s. You could manually curate Big-5s and text written by these celebrities to build these pairs.
Use synthetic data. Try asking an LLM (like ChatGPT) to write a text on a random topic as if they were $RANDOM big-5 then just use these results as your training pairs.
Try clustering. Potentially similar personality types have similar embeddings. Take a dataset of writings, embed them using something like BERT, label/best-effort-guess a few and then predict personalities based on the proximity of a piece of known big-5 text in the embedding space. You could extend this to training a model that asks "do text A and text B display the same big-5" which could potentially be an easier problem to get samples for and then run this model against a set of know big-5s and your unknown example.
Use a proxy. There might be datasets/models out there that predict heuristics that could be combined to find big 5. Like maybe a sentiment score is correlated with agreeableness. Potentially you might be able to create word/phrase banks such that using certain phrases is potentially indicative of a leaning on big-5 ("has_neurotic_phrases" is then a feature in your model)

[–] Veggies-are-okay@alien.top 1 points 2 years ago

Will back up the ChatGPT advice. It’s really amazing how much value LLMs can have in creating synthetic data.

[–] gettotea@alien.top 1 points 2 years ago

Really great ideas. The synthetic data one is worth going deeper into.