We are building an algorithm that automatically flags incoming news stories as “fake” or “real” based on th story’s information such as title, contents, authours, etc. The dataset can be found at https://www.macalester.edu/~ajohns24/data/buzzfeed.csv. This dataset includes the following information on 182 articles from 2016: title, text (content), authors, source, url address, and type (whether the article is real or fake).
In our algorithm, we made 15 predictors from the 5 columns in the dataset to predict type (fake/real) of the article. Note that our selection of predictors is limited by the availability and quality of the dataset. Comparing fake news and real news in the dataset, we assumed these things may differ.
article length - #words
use of exclamation mark in text (“Crazy punctuation - text”“)
Overall score - average negativity/positivity. Between -5 and 5
Our function finds words with angry/sad/happy connotation in the artcle. We calculate the percentage of words with these sentiments among the total number of words in the article. Similarly, our function find words with positive/negative tone and measure the overall negativity/positivity of the article between -5 and 5 based on the ration of these words. More information can be found at https://www.tidytextmining.com/sentiment.html.
To demonstrate and summarize the definitions of our new predictors, we chose one real sample article and one fake sample article.
Fake sample article - http://100percentfedup.com/new-disturbing-video-shows-hillarys-campaign-likely-faked-audience-nc-rally/
Title length - [WHOA! NEW DISTURBING VIDEO Shows HILLARY’S Campaign Likely FAKED Her Audience At NC Rally * 100percentfedUp.com] We expect that the fake articles are going to have titles that are mostly long and used to draw attention.
Article Length - We expect that a fake article is going to be short in length relative to a real article.
Crazy Punctuation (title and article) - We expect that a fake article may use a lot of punctuation such as exclamation points and question marks to draw emphasis.
Sentimemnt Analysis (title and article) - We expect that a fake article is going to have a higher percentage of angry and sad words and a smaller percentage of happy words. We expect that the overall score of the fake articles are going to be negative.
URL Length - We expect that a fake article is going to have longer URL lengths relative to the real articles.
Author or No? - We expect that a fake article is going to have less real authors than a real article.
Real sample article - http://abcn.ws/2cTj7ap
Title length - [Young Girl’s Emotional Council Speech Laments ‘Shame’ of Fatal Charlotte Shooting]. We expect for there to be variation of title length among the real articles.
Article Length - We expect that a real article is going to be longer relative to the fake articles.
Crazy Punctuation (title and article) - We expect that the real articles are going to have less punctuation used to draw emphasis such as exclamation points relative to the fake articles.
Sentimemnt Analysis (title and article) - We expect that real articles are going to have less angry and sad word percentages. We don’t know whether or not happy percentages will be much different relative to fake articles. We expect that a real article will have a positive overall score.
URL Length - We expect that a real article will have shorter URL lengths relative to fake articles.
Author or No? - We expec that real articles will have more authors relative to fake articles.
Our estimation is largely limited by the availability and the quality of the dataset we are using. It is not plausible to accurately detect fake news with a model solely based on 5 variables about 182 articles. As mentioned above, we created 15 predictor variables based on the information provided in the dataset so some of the variables are highly correlated or redundant. Text analysis is not a perfect measure of fakeness since it only reads the article as a group of separate words, thus completely ignores the context.
all_data <- read.csv("https://www.macalester.edu/~ajohns24/data/buzzfeed.csv")
library(tidytext)
library(dplyr)
library(syuzhet)
library(ggplot2) # for visualization
library(caret) # for machine learning
First, we made a series of functions to create the predictor variables in our model from the dataset. The first functions read the texts and split them into words, and counts the number of words with positive/negative connotations or happiness/sadness/anger. The other functions measure the length of the texts, count number of capital letters, and detect crazy punctuations by counting the number of exclamation mark.
getAnger <- function(string) {
stringArr <- strsplit(tolower(string), "\\s+")[[1]]
get_nrc_sentiment(string)$anger/length(stringArr)
}
getJoy <- function(string) {
stringArr <- strsplit(tolower(string), "\\s+")[[1]]
get_nrc_sentiment(string)$joy/length(stringArr)
}
getSadness <- function(string) {
stringArr <- strsplit(tolower(string), "\\s+")[[1]]
get_nrc_sentiment(string)$sadness/length(stringArr)
}
getNegPos <- function(string) {
poa_word_v <- get_tokens(tolower(string), pattern = "\\W")
syuzhet_vector <- get_sentiment(poa_word_v, method="afinn")
mean(syuzhet_vector)
}
allCaps <- function(string) {
stringArr <- strsplit(string, "\\s+")[[1]]
stringTab <- data.frame(words = stringArr, stringsAsFactors = FALSE)
allCaps <- stringTab %>% filter(words == toupper(words)) %>% filter(nchar(words) > 1) %>% filter(!(stringr::str_detect(words, "\\d")))
dim(allCaps)[1]/dim(stringTab)[1]
}
lenWords <- function(string) {
length(stringArr <- strsplit(string, "\\s+")[[1]])
}
crazyPunc <- function(string) {
stringArr <- strsplit(string, "\\s+")[[1]]
stringTab <- data.frame(words = stringArr, stringsAsFactors = FALSE)
exclamation <- stringTab %>% filter((stringr::str_detect(words, "!")))
dim(exclamation)[1]
}
Next, using these functions, we made variables from the dataset. THe processed dataset, \(new_data\) contains 20 variables, 14 of which are numerical values created by this process.
all_data <- all_data %>%
mutate(title = names(all_data)[1])
new_data <- all_data %>% mutate(titleSent = Vectorize(getNegPos)(title)) %>% mutate(textSent = Vectorize(getNegPos)(text))
new_data %>% mutate_if(is.factor, as.character) -> new_data
new_data <- new_data %>% mutate(titleHappy = getJoy(title)) %>%
mutate(titleSad = getSadness(title)) %>%
mutate(titleAngry = getAnger(title))
new_data <- new_data %>% mutate(textHappy = Vectorize(getJoy)(text)) %>%
mutate(textSad = Vectorize(getSadness)(text)) %>%
mutate(textAngry = Vectorize(getAnger)(text))
new_data <- new_data %>% mutate(capsTitle = Vectorize(allCaps)(title)) %>%
mutate(capsText = Vectorize(allCaps)(text))
new_data <- new_data %>% mutate(lenTitle = Vectorize(lenWords)(title)) %>% mutate(lenText = Vectorize(lenWords)(text))
new_data <- new_data %>% mutate(lenUrl = nchar(url))
new_data <- new_data %>% mutate(crazyTitle = Vectorize(crazyPunc)(title)) %>% mutate(crazyText = Vectorize(crazyPunc)(text))
Here is the list of variables in the \(new_data\). titleSent and textSent would be positive number if the text/title has more positive words but they would be negative if there are more negative words. title/text-Happy/Sad/Angry and caps-Title/text all represent percentage of these words/letters, so they are positive numbers. len-Title/Text/Url and crazy-Title/Text count the number of letters, thus all are positive intengers.
** The article’s title in the dataset was wierdly shown as “?..title” in the dataset. We made a new variable “title” from this variable to use them in the functions.
names(new_data)
## [1] "title" "text" "url" "authors" "source"
## [6] "type" "titleSent" "textSent" "titleHappy" "titleSad"
## [11] "titleAngry" "textHappy" "textSad" "textAngry" "capsTitle"
## [16] "capsText" "lenTitle" "lenText" "lenUrl" "crazyTitle"
## [21] "crazyText"
As first part of our analysis, we need to select the variables to include in the model. In order to do this, we used LASSO. Although we we only used LASSO for numerical dependent/output variables in the class, we are able to use LASSO by taking the log of type variable. The line family=“binomial” in the model does this for us.
set.seed(33)
lambda_grid <- seq(0, 1, length = 100)
# Perform LASSO
lasso_model <- train(
as.factor(type) ~ titleSent + textSent + titleHappy + titleSad + titleAngry + textHappy + textSad + textAngry + capsTitle + capsText + lenTitle + lenText + lenUrl + crazyTitle + crazyText,
data = new_data,
method = "glmnet",
family = "binomial",
tuneGrid = data.frame(alpha = 1, lambda = lambda_grid),
trControl = trainControl(method = "cv", number = 10, selectionFunction = "best"),
metric = "Accuracy",
na.action = na.omit
)
coef(lasso_model$finalModel, 0.001)
## 16 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 1.831747e+00
## titleSent .
## textSent 7.362394e+00
## titleHappy .
## titleSad .
## titleAngry .
## textHappy 3.260594e+01
## textSad -7.418562e+00
## textAngry 1.185829e+01
## capsTitle .
## capsText -1.139635e+01
## lenTitle .
## lenText 1.558999e-04
## lenUrl -2.676775e-02
## crazyTitle .
## crazyText -4.838821e-01
coef(lasso_model$finalModel, 0.09)
## 16 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 1.07915012
## titleSent .
## textSent .
## titleHappy .
## titleSad .
## titleAngry .
## textHappy .
## textSad .
## textAngry .
## capsTitle .
## capsText .
## lenTitle .
## lenText .
## lenUrl -0.01546806
## crazyTitle .
## crazyText .
coef(lasso_model$finalModel, lasso_model$bestTune$lambda) #best
## 16 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 1.80729688
## titleSent .
## textSent 5.33163067
## titleHappy .
## titleSad .
## titleAngry .
## textHappy 7.52533357
## textSad .
## textAngry .
## capsTitle .
## capsText -4.99610466
## lenTitle .
## lenText .
## lenUrl -0.02331884
## crazyTitle .
## crazyText -0.28125098
lasso_model$bestTune$lambda
## [1] 0.02020202
Our final model includes the following 5 variables.
Since our model is a log-linear regression model, it predicts the Log value of the odds of the article being a real news. (It’s confusing because it doesn’t say in the output table. But fake is treated as 0 and real is 1.) The positive coefficients can be interpreted as the higher likelihood of the article being a real news. In contrast, negative coefficients mean more likelihood of fake news. For example, the overall scores for text sentiment can be associated with higher likelihood of reporting a real news. Long url and crazy text are associated with higher likelihood of fake news. We can confirm this relationship from the mean statistics across the type as well. In the table below, fake articles have lower text sentiment score, longer url and higher crazyText score.
new_data %>%
group_by(type) %>%
summarize_all(funs(mean))
## # A tibble: 2 x 21
## type title text url authors source titleSent textSent titleHappy
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 fake NA NA NA NA NA 0 -0.0198 0
## 2 real NA NA NA NA NA 0 -0.00419 0
## # … with 12 more variables: titleSad <dbl>, titleAngry <dbl>,
## # textHappy <dbl>, textSad <dbl>, textAngry <dbl>, capsTitle <dbl>,
## # capsText <dbl>, lenTitle <dbl>, lenText <dbl>, lenUrl <dbl>,
## # crazyTitle <dbl>, crazyText <dbl>
levels(as.factor(new_data$type))
## [1] "fake" "real"
We also performed LASSO using \(oneSE\) as the selection function but the result was only leaving one variable as a predictor, lenUrl, so we chose to use best instead.
lasso_model_se <- train(
as.factor(type) ~ titleSent + textSent + titleHappy + titleSad + titleAngry + textHappy + textSad + textAngry + capsTitle + capsText + lenTitle + lenText + lenUrl + crazyTitle + crazyText,
data = new_data,
method = "glmnet",
family = "binomial",
tuneGrid = data.frame(alpha = 1, lambda = lambda_grid),
trControl = trainControl(method = "cv", number = 10, selectionFunction = "oneSE"),
metric = "Accuracy",
na.action = na.omit
)
coef(lasso_model_se$finalModel, lasso_model_se$bestTune$lambda) #oneSE
## 16 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 0.133655066
## titleSent .
## textSent .
## titleHappy .
## titleSad .
## titleAngry .
## textHappy .
## textSad .
## textAngry .
## capsTitle .
## capsText .
## lenTitle .
## lenText .
## lenUrl -0.001914513
## crazyTitle .
## crazyText .
Here are the plots for the most important predictors in the model and the Accuracy stats for different lambda. It tells us that TextSent and capsText are the most important predictors
# Plot coefficients for each LASSO
plot(lasso_model$finalModel, xvar = "lambda", label = TRUE, col = rainbow(20))
# Codebook for which variables the numbers correspond to
rownames(lasso_model$finalModel$beta)
## [1] "titleSent" "textSent" "titleHappy" "titleSad" "titleAngry"
## [6] "textHappy" "textSad" "textAngry" "capsTitle" "capsText"
## [11] "lenTitle" "lenText" "lenUrl" "crazyTitle" "crazyText"
lasso_model$bestTune
## alpha lambda
## 3 1 0.02020202
plot(lasso_model)
# Perform logistic regression
logistic_model_1 <- train(
as.factor(type) ~ textSent + textHappy + capsText + lenUrl + crazyText,
data = new_data,
method = "glm",
family = "binomial",
trControl = trainControl(method = "cv", number = 10),
metric = "Accuracy",
na.action = na.omit
)
logistic_model_2 <- train(
as.factor(type) ~ textSent + textHappy + capsText + lenUrl + textSad + textAngry + lenText + crazyText,
data = new_data,
method = "glm",
family = "binomial",
trControl = trainControl(method = "cv", number = 10),
metric = "Accuracy",
na.action = na.omit
)
Here is the comparing our final model with another logistic regression model with all the variables in the dataset. We can see that our model has higher CV accuracy, indicating that we should choose the model 1 over the model 2, which seems to be overfitting.
logistic_model_1$results
## parameter Accuracy Kappa AccuracySD KappaSD
## 1 none 0.7788889 0.5577778 0.0884945 0.176989
logistic_model_2$results
## parameter Accuracy Kappa AccuracySD KappaSD
## 1 none 0.7634503 0.5266562 0.03894857 0.0777487
# Model summary table
summary(logistic_model_1)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1108 -0.8072 0.1333 0.7168 2.3612
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.017963 0.470666 4.287 1.81e-05 ***
## textSent 6.561786 4.604596 1.425 0.1541
## textHappy 34.789023 34.001048 1.023 0.3062
## capsText -12.404935 8.254485 -1.503 0.1329
## lenUrl -0.027406 0.004599 -5.959 2.53e-09 ***
## crazyText -0.478182 0.275971 -1.733 0.0831 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 252.31 on 181 degrees of freedom
## Residual deviance: 188.27 on 176 degrees of freedom
## AIC: 200.27
##
## Number of Fisher Scoring iterations: 5
# Coefficients
coef(logistic_model_1$finalModel)
## (Intercept) textSent textHappy capsText lenUrl
## 2.01796328 6.56178597 34.78902281 -12.40493460 -0.02740631
## crazyText
## -0.47818218
# CV accuracy metrics
logistic_model_1$results
## parameter Accuracy Kappa AccuracySD KappaSD
## 1 none 0.7788889 0.5577778 0.0884945 0.176989
logistic_model_1$resample
## Accuracy Kappa Resample
## 1 0.6666667 0.3333333 Fold01
## 2 0.8333333 0.6666667 Fold02
## 3 0.7222222 0.4444444 Fold03
## 4 0.9000000 0.8000000 Fold04
## 5 0.7777778 0.5555556 Fold05
## 6 0.9444444 0.8888889 Fold06
## 7 0.7222222 0.4444444 Fold07
## 8 0.7222222 0.4444444 Fold08
## 9 0.7222222 0.4444444 Fold09
## 10 0.7777778 0.5555556 Fold10
predict_data <- na.omit(logistic_model_1$trainingData)
classifications <- predict(logistic_model_1, newdata = predict_data, type = "raw")
We use a confusion matrix to describe the performance of our classification (lasso) model. The overall accuracy of our model is .786 or 78.6% accurate. The sensitivity measures the fraction of positive cases that are correctly classified. The sensitivity of our model is .824 or 82.4%. The specificity is the fraction of negative cases that are correctly classified. The specificity of our model is .747 or 74.7%.
confusionMatrix(
data = classifications,
reference = as.factor(predict_data$.outcome),
positive = "fake"
)
## Confusion Matrix and Statistics
##
## Reference
## Prediction fake real
## fake 75 23
## real 16 68
##
## Accuracy : 0.7857
## 95% CI : (0.7189, 0.843)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : 1.875e-15
##
## Kappa : 0.5714
##
## Mcnemar's Test P-Value : 0.3367
##
## Sensitivity : 0.8242
## Specificity : 0.7473
## Pos Pred Value : 0.7653
## Neg Pred Value : 0.8095
## Prevalence : 0.5000
## Detection Rate : 0.4121
## Detection Prevalence : 0.5385
## Balanced Accuracy : 0.7857
##
## 'Positive' Class : fake
##
An ROC curve plots the sensitivity (y-axis) and 1 - specificity (x-axis) associated with every possible probability cut-off between 0 & 1. This provides us with a picture of the confusion matrix that we did above. To describe the performance of our lasso model, we examine the area under the curve (AUC). The AUC estimates the probability that a classification tool is more likely to classify y=1 as y=1 than to classify y=0 as y=1. The area under the curve for our model is .807
library(pROC)
# model 1 = black
predict_data_1 <- na.omit(logistic_model_1$trainingData)
roc(response = predict_data_1$.outcome, pred = fitted(logistic_model_1), plot = TRUE, legacy.axes = TRUE)
##
## Call:
## roc.default(response = predict_data_1$.outcome, predictor = fitted(logistic_model_1), plot = TRUE, legacy.axes = TRUE)
##
## Data: fitted(logistic_model_1) in 91 controls (predict_data_1$.outcome fake) < 91 cases (predict_data_1$.outcome real).
## Area under the curve: 0.8069
We built an algorithm that automatically flags incoming news stories as “fake” or “real” based on the following information; title, contents, authors, sources, and the URL of the article. Using 182 articles from 2016 flagged “fake” or “real”, we first made 15 predictor variables from the information to build the model. In this process, we used function to read the texts and detect the sentiment of the words in the texts. Next, we used LASSO to select the variables to include in the model. Here are the coefficients of the significant variables according to our LASSO model.
textSent 6.561786
textHappy 34.789023
capsText -12.404935
lenUrl -0.027406
crazyText -0.478182
We found that our predictor variables for our text sentiment analysis (txtSent) and for the percent of happy words in the article (textHappy) both had positive coefficients. This means that real articles had more positive words based off of our range indicator and higher percentages of happy words than fake articles did. On the contrary, we found that our predictor variables for the percentage of capital letters in the text (capsText) had a negative sign. This means that fake articles had more capital letters than real articles. Additionally, we found that our predictor variable for the length of the URL (lenUrl) was negative. This indicates that real articles had URLs that were typically shorter in length than the fake articles. Lastly, we found that our predictor variable for the amount of exclamation points used (crazyText) was negative. This indicates that real articles had less exclamation points than fake articles did.
All of these coefficients made sense to the model and what we initially predicted. Starting with the word sentiments, we expected for the fake articles to be more negative in nature than the real articles. This follows through with our coefficients for happy and positive sentiment. One potential drawback of our model is that these two predictors are correlated. We decided to include both of them in the model since the numbers are not necessarily too similar for the same article. For example, the overall sentiment score can be negative values while happy sentiment is a percentage of happy words in the texts, thus always positive or zero. Additionally, we expected for the fake articles to contain more exclamation points and capital letters to catch reader’s attention and make their messages seem more believable. This is also seen with the coefficients for capital text, and crazy text.
In order to evaluate our model, we then compared our model with another model including all the predictors except for the ones made from title. We excluded these predictors made from titles since we found most of these values were zero for the majority of the articles due to the limited amount of information we can get from titles. Our results showed the accuracy for our model was almost the same or slightly higher than the alternative model. Since our model includes less variables, we confirmed our model is better suited for the analysis. In spite of the limited information provided in the dataset, our final model performed better than we originally expected, with 78.6% overall accuracy, 82.4% sensitivity, and 74.7% specificity.