Have you already tried using pre-trained models?
Machine Learning
Community Rules:
- Be nice. No offensive behavior, insults or attacks: we encourage a diverse community in which members feel safe and have a voice.
- Make your post clear and comprehensive: posts that lack insight or effort will be removed. (ex: questions which are easily googled)
- Beginner or career related questions go elsewhere. This community is focused in discussion of research and new projects that advance the state-of-the-art.
- Limit self-promotion. Comments and posts should be first and foremost about topics of interest to ML observers and practitioners. Limited self-promotion is tolerated, but the sub is not here as merely a source for free advertisement. Such posts will be removed at the discretion of the mods.
First off, it’s kind of a funny task given people like to complain some people treat politics like sport these days.
What data do you have access to - just the tweet text or is there other metadata like username, time, bio, profile picture etc. ?
I added a sample data to the post body it's basically this:
Data fields
- TweetId - an anonymous id unique to a given tweet
- Label - the associated label which is either Sports or Politics
- TweetText - the text in a tweet
Who set up the score of 0.97 as a goal? Are you sure it is attainable given the data? Have others provided kernels/notebooks that attain these scores? In most cases it’s not the lack of modelling on your side but rather lack of data on the other.
That’s easy: model stacking/ensembling. All winning Kaggle grandmasters use it.
You have one model only, using only one technique.
There are several other classic approaches to NLP classification tasks: Naive Bayes, SVMs, CBOW, etc.
The idea behind model stacking: train different models, each one using a different method. Then, train a meta-model, which uses as features the output of each individual model.
This will significantly improve your score. It’s how people win Kaggle competitions.
Have you tried training on additional data? There's a lot of sports and politics text out there. If the competition is not using the 20 newsgroups dataset I think you might want to check it out.