This is a documentation that outlines an application of the OSEMN data science processes along with some basic data science techniques (i.e. NLP, Machine Learning) to make a clickbait classifier in Python. I hope this serves as a good example for those who are starting to learn data science. Here is the repository link.
The objective of this project was to identify clickbait and non-clickbait articles based on their headline text, using Natural Language Processing techniques and various Machine Learning algorithms.
But first off, what is a clickbait?
Wikipedia defines clickbait as: a text or a thumbnail link that is designed to attract attention and to entice users to follow that link and read, view, or listen to the linked piece of online content, with a defining characteristic of being deceptive, typically sensationalized or misleading.
These articles are not only annoying, but can lead the user into unwanted websites, expose the user’s device to viruses, and deliver inaccurate information.
The data science approach used in this project was the OSEMN model , summarized nicely in the figure above — for further reading, check out this link.
Step 1: Obtain
- clickbait headlines from ‘BuzzFeed’, ‘Upworthy’, ‘ViralNova’, ‘Thatscoop’, ‘Scoopwhoop’ and ‘ViralStories’ (16,000)
- non-clickbait headlines from ‘WikiNews’, ’New York Times’, ‘The Guardian’, and ‘The Hindu’ (16,000)
Webscraping from clickbait websites (16,707)
- clickhole.com (4,526)
- worldtruth.tv (12,181)
API’s from major press companies (16,660)
- The New York Times (11,460)
- The Guardian (5,200)
Total of 65,367 headlines — 32,707 clickbait and 32,660 non-clickbait headlines
Step 2: Scrub
- Check for null values
- Tokenized data
- Checked for encoding or strange words
- Removed stop words
- Separated numeric characters and non-numeric characters
- Lemmatized non-numeric list of words
Step 3: Explore
Word clouds generated from lemmatized list of words (clickbait and non-clickbait, respectively)
Step 4: Model
- Test-train split: 75:25 split
- F1-score was chosen to be the representative metric for model evaluation, because it is generally the most strict metric, and tends to penalize more on wrong predictions.
- Train F1-score: 99.8939%
- Test F1-score: 90.0270%
- Train F1-score: 99.9959%
- Test F1-score: 86.3288%
SVM (Support Vector Machines):
- Train F1-score: 99.9980%
- Test F1-score: 90.2046%
- Train F1-score: 99.9980%
- Test F1-score: 90.1803%
KNN (K-Nearest Neighbors):
- Train F1-score: 91.1718%
- Test F1-score: 84.7322%
Step 5: Interpret
Below is the feature importances graph, generated from the Random Forest model:
Clickbait headlines’ main objective is to grab people’s attention — frequently used forms of text include:
- numerical values (i.e. 32 Cute Things To Distract From Your Awkward Thanksgiving)
- informal language (i.e. Does Coffee Make You Poop)
- vague language to trigger curiousity (i.e. Here’s One Really Weird Thing About Butterfree)
Nonclickbait headlines also try to grab people’s attention, but are still primarily focused on delivering news information, including:
- specific language that accurately delivers the main message (i.e. Albanian girl murdered in tangle of crime)
- formal language and occasional use of jargons (i.e. Blair: G8 leaders announce $50 billion aid increase; talks on trade and climate change)
- heavily focused on safety, economics, and politics issues (i.e. Fed Calls Gain in Household Wealth a Mirage)
Conclusion and Future Works
It is possible to accurately predict whether an article is clickbait or not just by looking at the headline text, due to their noticeable differences
- Model performances may be improved with:
- Additional data
- Further feature engineering (i.e. length of headline, number of capitalied letters, whether headline starts with a number)