Volume 17 Number 2
March 2020
Article Contents
Harita Reddy, Namratha Raj, Manali Gala and Annappa Basava. Text-mining-based Fake News Detection Using Ensemble Methods. International Journal of Automation and Computing, vol. 17, no. 2, pp. 210-221, 2020. doi: 10.1007/s11633-019-1216-5
Cite as: Harita Reddy, Namratha Raj, Manali Gala and Annappa Basava. Text-mining-based Fake News Detection Using Ensemble Methods. International Journal of Automation and Computing, vol. 17, no. 2, pp. 210-221, 2020. doi: 10.1007/s11633-019-1216-5

Text-mining-based Fake News Detection Using Ensemble Methods

Author Biography:
  • Harita Reddy received the B. Tech. degree in computer science and engineering from National Institute of Technology Karnataka, India in 2019. She is currently working as a software engineer at Uber, India. Her research interests include data mining, machine learning and social network analysis. E-mail: harita.nitk@gmail.com (Corresponding author) ORCID iD: 0000-0002-3314-7880

    Namratha Raj received the B. Tech. degree in computer science and engineering from National Institute of Technology Karnataka, India in 2019. Her research interests include data science, machine learning, natural language processing and bioinformatics. E-mail: namratha.mraj@gmail.com ORCID iD: 0000-0002-2114-1553

    Manali Gala received the B. Tech. degree in computer science and engineering from National Institute of Technology Karnataka, India in 2019. She is currently an analyst at Goldman Sachs, India. Her research interests include machine learning and data analysis. E-mail: manaligala7@gmail.com ORCID iD: 0000-0002-5982-9062

    Annappa Basava received the B. Eng. degree in computer science and engineering from the Govt. B.D.T. College of Engineering, Davangere affiliated to Mysore University, India in 1991, and received the M. Tech. and Ph. D. degrees in computer science and engineering from National Institute of Technology Karnataka, India in 2003 and 2012, respectively. Currently, he is a professor in the Department of Computer Science and Engineering, National Institute of Technology Karnataka, India. He has published more than 100 research papers in international conferences and journals. He has more than 20 years of experience in teaching and research. He was the Organizing Chair of International Conference on Advanced Computing 2013 and he is in the Technical Progam Committee of many international conferences and reviewer of journals. Currently, he is the Chair of India Council of the IEEE Computer Society and he was the Chair of IEEE Mangalore Subsection during 2018. He was the Secretary of IEI Mangaluru Local Centre. He is a Fellow of Institution of Engineers (India) and senior member of IEEE, ACM. Four research scholars completed their Ph. D. under his supervision and 7 scholars are currently enrolled for research under his supervision. His research interests include cloud computing, big data analytics, distributed computing, software engineering and process mining.E-mail: annappa@ieee.orgORCID iD:0000-0002-4049-3677

  • Received: 2019-06-13
  • Accepted: 2019-12-11
  • Published Online: 2020-02-18
通讯作者: 陈斌, bchen63@163.com
  • 1. 

    沈阳化工大学材料科学与工程学院 沈阳 110142

  1. 本站搜索
  2. 百度学术搜索
  3. 万方数据库搜索
  4. CNKI搜索

Figures (7)  / Tables (17)

Metrics

Abstract Views (802) PDF downloads (75) Citations (0)

Text-mining-based Fake News Detection Using Ensemble Methods

Abstract: Social media is a platform to express one′s views and opinions freely and has made communication easier than it was before. This also opens up an opportunity for people to spread fake news intentionally. The ease of access to a variety of news sources on the web also brings the problem of people being exposed to fake news and possibly believing such news. This makes it important for us to detect and flag such content on social media. With the current rate of news generated on social media, it is difficult to differentiate between genuine news and hoaxes without knowing the source of the news. This paper discusses approaches to detection of fake news using only the features of the text of the news, without using any other related metadata. We observe that a combination of stylometric features and text-based word vector representations through ensemble methods can predict fake news with an accuracy of up to 95.49%.

Harita Reddy, Namratha Raj, Manali Gala and Annappa Basava. Text-mining-based Fake News Detection Using Ensemble Methods. International Journal of Automation and Computing, vol. 17, no. 2, pp. 210-221, 2020. doi: 10.1007/s11633-019-1216-5
Citation: Harita Reddy, Namratha Raj, Manali Gala and Annappa Basava. Text-mining-based Fake News Detection Using Ensemble Methods. International Journal of Automation and Computing, vol. 17, no. 2, pp. 210-221, 2020. doi: 10.1007/s11633-019-1216-5
    • The ease of access to the world wide web (WWW) has made it possible for people in every corner of the world to get real-time global news. With the advent of social media, rapid dissemination of news is possible; content can be shared with friends or followers, and thus information diffusion takes place on social networks[1]. However, this ease of accessibility to social media also leads to the prevalence of fake news, which is written in such a way that people are misled into believing false information presented by the news[2]. Some of the sources for fake news include people/bots that deliberately manipulate information for political agendas, and gossip stories on entertainment-related websites[2]. In the 2016 US presidential elections, fake news was found to have attracted greater engagement from social media users than the news published by conventional news sources[3]. Bovet and Makse[4] observed that 25% of the 30 million tweets that contained links to news sources during the 5 months till the election date were either highly biased or untrue. False information has been found to have faster diffusion, especially for politics-related news, and it evokes feelings like disgust and fear, as reflected in the replies given by the readers[5]. There have been many instances where readers believe such news without verifying the authenticity of the news content from trustworthy sources[6, 7].

      This paper focuses on the improvement of the state-of-art techniques to identify fake news on social media by using stylometric (linguistic) features and word vector representations of the textual content. Finally, the stylometric features and the various word vector features are combined by applying ensemble methods: bagging, boosting and voting. No information related to the users or the media content in the news articles has been used, which is an advantage of our method because it does not require any other metadata and protects user privacy by using only the features of the text.

    • Many techniques have been proposed to identify fake news, which include data mining and social network analysis methods. Shu et al.[2] classify fake news detection models into news content models and social context models. Conroy et al.[8] propose operational guidelines for designing a system for verification of news. The authors see promising results by providing an innovative hybrid approach that combines linguistic cues and machine learning, with network-based behavioral data.

      Natural language processing techniques for the detection of fake news have been evaluated by Gilda[9]. Term frequency-inverse document frequency (TF-IDF)[10] of bi-grams and probabilistic context-free grammar (PCFG) detection was used with various models including stochastic gradient descent and gradient boosting. TF-IDF of bi-grams with stochastic gradient descent model identified fake news with an accuracy of 77.2%. However, only the vector based approach cannot be used to analyze specific features and train the classifiers as these are specific to the particular training dataset.

      Ruchansky et al.[11] proposed a model with three modules: capture, score, and integrate. The first module is based on the response of the users and text present in the piece of news; it uses a recurrent neural network (RNN) to capture the temporal pattern of user activity on a given article. The second module learns the characteristics of the source based on the behavior of users, and in the third module, the previous two modules are integrated to classify an article as fake or not. This work combines text, response and source user information. This model detects fake news from the Twitter dataset with an accuracy of 89.2% and from Weibo with an accuracy of 95.3%.

      Buntain and Golbeck[12] used structural, content-based, user and temporal features to design a system to detect fake news in popular Twitter threads. The content-based features include polarity, subjectivity and disagreement. Their system′s applicability is limited to highly re-tweeted threads of Twitter conversations, and in real-life, most tweets are rarely re-tweeted. Their high performing model applied on the BuzzFeed dataset achieved an accuracy of 65.29%.

      A combination of textual and user features was used by Krishnan and Chen[13]. User features included number of friends, number of followers, friends to follower ratio and whether the user has a verified URL or not. Textual features include tweet length, word count, number of question marks, number of exclamation marks, number of URLs, number of capital letters, number of hashtags, etc. With an accuracy of 80.68% for the Hurricane Sandy dataset, they obtain a high recall without compromising too much on precision. However, their selected features are only applicable to social networks that have a concept of friends, followers and user verification.

      Jin et al.[14] made one of the significant attempts to use images for verification of news by using visual features like clarity and coherence score, and statistical features of images like count and image ratio. Using these image features, they achieved the highest verification accuracy of 83.6%. This accuracy was boosted by more than 7% compared with other approaches that use non-image features only.

      Deep learning approaches, which have gained ground in the past few years, have also been used in fake news detection. Yang et al.[15], have used both text and image information to train a model named as the text and image information based convolutional neural network (TI-CNN). They have used sentiment and lexical diversity for text. For images, they observed that real news had more images of faces whereas fake news had more irrelevant images. In their model, they used two parallel CNNs to extract latent features from both textual and visual information and achieved the highest precision of 0.92 and recall of 0.9227. CNNs however, require a large dataset and using them to analyze both text and images tends to be computationally expensive.

      If the source user′s attributes and network information are available, that information can be also used to give a better judgment of the reliability of the news. However, sometimes it is not possible to obtain all this meta data about user and user connections, especially due to user privacy concerns. Though writers of fake news try to frame the text in such a way that it appears genuine, fake news can be detected by observing some generic textual features. This work uses stylometric features of the text, i.e., the features based on the style of writing, as well as word vector representations of the text for classifying the news.

      In Section 3, we explain our proposed methodology, which includes information about the dataset used, data pre-processing steps, feature extraction, feature selection and classification. The detailed results, along with discussions are presented in the results and discussions section, which also includes comparison with other research. Finally, we conclude our work and suggest some future directions.

    • Two datasets have been combined for evaluating the proposed methodology. The FakeNewsNet[16] dataset is a result of a data collection project for fake news research at the Arizona State University, with the labeling done on the basis of fact-checking websites like PolitiFact. The McIntire Dataset[17] has been hosted by George McIntire and contains a balanced collection of fake and real news. The dataset mainly includes political news ranging from left-wing to right-wing sources in relation with the 2016 US elections. Political news is deliberately manipulated to spread political propaganda, which is often done through social media bots, with this practice being prevalent during elections[2]. By observing the fake news samples in the dataset, it can be noticed that many political articles are often intended to portray candidates from a certain political wing in a negative light, which can help shape people′s minds for electoral gains.

    • After combining the two datasets, the final combined dataset has only 2 columns, one containing the text and the other containing the label. The training set contains 5405 news articles and the test set contains 1352 news articles. The news articles are mostly based on US politics. The training set has a balanced distribution with 2696 real news samples (49.9%) and 2709 fake news samples (50.1%), as depicted in Fig. 1.

      Figure 1.  Data distribution of real and fake news articles

    • Stylometry is the study of linguistic features of a piece of text, usually used to verify the authenticity or authorship of text based on the linguistic style of writing. Linguistic features can be useful because these features change while writing deceptive content to hide the author′s original writing style. Three feature sets were used as a reference for extracting stylometric features in the proposed methodology. Only the relevant features are extracted and used from the text-based dataset. The feature sets are:

      Feature set 1: It is a minimal feature set used for authorship attribution in the work by Brennan and Greenstadt[18]. The features include number of unique words, complexity, Gunning-Fog index[19], character counts with and without whitespaces, average syllables per word, sentence count, average sentence length, and Flesch-Kincaid readability score. The gunning-fog index (GI) is a popularly used formula to determine the number of years of education required by an individual to understand a certain piece of text on the first reading itself. Flesch kincaid readability score (FS) is another readability test developed by Flesch and Kincaid to evaluate the difficulty in reading a piece of text. The higher the Flesch-Kincaid score, the easier it is to read the text.

      $ GI = 0.4\left[\frac{\#words}{\#sentences} + 100\frac{\#difficult words}{\#words} \right] \;\;$

      (1)

      $ FS = 206.835 - 1.015\frac{\#words}{\#sentences} - 84.6\frac{\#syllables}{\#words} . $

      (2)

      Feature set 2: This feature set is based on the lying detection dataset[20-22] and it includes features that were known to be quite effective in lie detection (Table 1).

      CategoryFeatures
      Quantity#syllables, #words, #sentences
      Vocabulary#big words, #syllables per word
      Grammar#short sentences, #long sentences,
      Flesh-Kincaid score, #words per sentence,
      sentence complexity, #conjunctions
      Uncertainty#certainty words, #tentative words,
      #modal verbs
      Specificityadjective and adverb rates, #affective terms
      Verbal non-immediacyself-references, #first, second and third person pronouns

      Table 1.  Feature set 2 (Based on lying detection dataset)

      We consider a word to be a big word if it has 6 or more characters, and a sentence to be a short one if it contains 10 or less words. Tentative words include words like “likely” and “probably” that express uncertainty.

      Feature set 3: Zheng et al.[23] introduced the writeprints feature set for authorship attribution in short documents. The selection of features from writeprints is quite exhaustive and gives 73 attributes. The features selected from the writeprints set are listed in Table 2.

      CategoryFeatures
      Character#characters, % of digits, % of letters, % of uppercase letters, % of whitespace, letter frequencies, special character frequency
      Wordwords,% of short words (less than 4 chars), % of characters in words, avg. sentence length (chars), avg sentence length (words), #different words, once-occurring words′ frequency, twice-occurring words frequency, different length words frequency, Yule′s K measure
      Synctacticpunctuation and function word frequency
      Structural#lines, #sentences, #of paragraphs, #sentences per paragraph, #characters per paragraph, #words per paragraph, Has a greeting, Has quoted content, Has URL
      ContentFrequency of content specific keywords

      Table 2.  Feature set 3 (Based on writeprints features)

      Function words are the words that contribute to the syntax of the sentence rather than its meaning. Greeting words include “hello”, “good afternoon”, “good evening”, “good morning”. Yule′s K measure[24] is a commonly used technique of evaluating the vocabulary difficulty of texts through the measurement of their lexical richness.

    • The raw sequence of text data cannot be fed directly to the classifier. It has to be converted into vectors of numbers of a fixed size. Vector space models are often used to represent text documents in the form of vectors. Scikit-Learn[25] vectorization methods and also Gensim[26] Word2Vec, FastText (FT) models have been used to convert words to vectors.

      Simple bag-of-words (BOW) count vector: Using scikit-learn package, the whole dataset is tokenized and the frequency of occurrence of each of these tokens in every document is calculated. The output would be a matrix M where every row represents a text document and every column represents a token (Fig. 2). $ M[i,j] $ is the count of the token $ j $ in document $ i $, i.e., the number of times token $ j $ occurs in document $ i $. This is one of the simplest models for vectorization of documents.

      BOW TF-IDF vector: Using scikit learn [25], the whole dataset is tokenized and the TF-IDF metric of each of these tokens in every document is calculated. It gives a matrix as the output (Fig. 2), where every row gives the TF-IDF metric of every word present in that document (row), and if the word is not present then the value is taken as 0. TF-IDF of a token is calculated using the following two equations:

      Figure 2.  Count or TF-IDF matrix M

      $ TF(t) = \dfrac{n_{td}}{n_d} $

      (3)

      $ IDF(t) = {\rm log}\frac{D} {n_{t}} $

      (4)

      where $ n_{td} $ is the number of times the token $ t $ appears in a document $ d $, $ n_d $ is the total number of tokens in document $ d $, D is the total number of documents and $ n_t $ is the number of documents with the token $ t $ in it. These 2 matrices suffer from the problem of sparsity with many rows containing 0.

      Continuous bag of words (CBOW): It is a well known vector space model that generates dense vector representations for text. The CBOW architecture[27] predicts the target word from the context words given as input to a neural network. We can obtain the numerical vector form of each token in the text through this neural network architecture.

      Skip-gram (SG): In the skip-gram architecture[27], the current word is given as an input to a neural network to predict the context words. CBOW and skip-gram are implemented using Word2Vec and FastText in the Gensim[26] toolkit. In Word2Vec[27], a word is encoded into a vector using a relation between words and its surrounding words using a neural net, whose hidden layer encodes the vector. Proposed as an extension to Word2Vec for proper representation of rare words, FastText[28] breaks down words into several n-grams (sub-words) and combines the values of all these n-grams to give a single vector for a word.

      After getting vectors for every token in the vocabulary, two methods are followed.

      Method 1: In this method, the mean of the vector of a token is assigned to every token of a news piece and all the mean values are combined in the order the tokens are present in the text document to feed the classification model. The size of the input vector is equal to the size of a news piece which has the maximum number of tokens in it. If the size of a vector is shorter, it is padded with zeros at the end.

      Method 2: A matrix is created of dimension ($ no $_$ of $_$ data $_$ samples $×$ vocabulary $_$ size $). The mean of the j-th word embedded vector is added in the cell $ ij $ if that word is present in the i-th Newspiece or “0” if not present (similar to the count-vector, TF-IDF matrix given above). In this method, the infrequently used words are pruned and the max-limit of vocabulary size is taken as 10 000. Every row, being a vector, is given as input to the classifiers. The dataset is pruned to reduce the runtime and to extract only the main excerpt from the documents.

    • Feature selection is useful for selecting only those features that have an impact on the determination of whether a piece of news is fake or not. Feature selection is applied both on stylometric and word-vector features to reduce the dimensions of the dataset.

    • Recursive feature elimination has been used for the purpose of selecting the most important features from the stylometric feature sets. It removes the weakest or the features with least importance till the number of features in the dataset have reduced to a particular value.

    • Few words were removed from the word vector space or vocabulary using word net lemmatizer and port stemmer. Both stemming and lemmatization are used for reduction of words to their root form, but stemming is a crude method which does not take into consideration the part of speech (POS) or the context of the words[29]. The Chi-square test was used for feature selection to reduce the time complexity issues that arise when dealing with large vector spaces. Using this method, the top 25 000 important words were selected. Also, the Lemmatizer, Stemmer and Chi-square tests were combined to reduce the word vocabulary to a greater extent to ensure a good performance.

      After every dimensionality reduction step, count-vectors, TF-IDF vectors and other word vectors are obtained and classification is performed.

    • We have used random forest (RF)[30], naive bayes (NB)[31] (Gaussian and multinomial), support vector machine (SVM)[32, 33], KNN, logistic regression (LR), bagging[34] with general bagging classifier and extra trees classifier[35], and boosting with adaboost[36] and stochastic gradient boosting[37] for the classification purpose.

    • After applying classification methods directly on the stylometric and word vector features separately, we work on the combination of the two types of features. We combine the features using ensemble methods: bagging, boosting and voting.

      Bagging: In the bagging methodology, $ n $ random subsets (with replacement) are selected from the training set. A machine learning model is trained on each of these subsets. For a given test example, all the classifiers are used to make a prediction, and the final prediction is the mode of all the predictions.

      In this work, stylometric and word vector features are combined. In each of the $ n $ random subsets, a training example comprises of both stylometric and word vector features. For all the training examples within the subset, the stylometric features are accumulated into the list $ stylometric\_subset $ and the word vector features are accumulated into the list $ word\_vectors\_subset $. The random forest classifier is trained on the $ stylometric\_$subset and the logistic regression is trained on $ word\_vectors\_subset $. These two trained classifiers are applied on the stylometric features and word vector features of all the examples in the testing set respectively. The two predictions from the random forest classifier and the logistic regression classifier for each testing example are stored. This procedure is repeated with all the training subsets.

      We obtain 2n predictions for each testing example. The final prediction for a particular testing example is the mode of all the 2n predictions for the example.

      Boosting: In this method, the vector of stylometric features is combined with the word vector representation of texts to give one vector for each sample. Then, the boosting algorithm is applied for training and testing on these vectors.

      Voting: In this method, the vector of stylometric features is combined with the word vector representation of texts to give one vector for each sample. Three classifiers are trained on all the training samples and the final prediction for testing is simply the vote of the predictions from the three classifiers.

    • We measure the performance of the classifier in case of both stylometric and word vector features using accuracy, precision, recall and F-score. The classifier is trained on the training set (5 405 articles) and tested on the test set (1 352 articles). We give the accuracy, precision, recall and F-score obtained by the classification model on the test set.

    • The results obtained on applying naive Bayes and random forest classifier on stylometric features (three feature sets) without performing feature selection are tabulated in Table 3.

      Feature setClassifierAcc. (%)Prec.Rec.F1
      Set 1RF590.630.510.57
      NB620.590.960.73
      Set 2RF780.790.750.77
      NB590.670.340.45
      Set 3RF830.850.790.82
      NB690.710.630.67

      Table 3.  Results with RF and NB on the 3 feature sets

      The feature set 1 gives a very poor accuracy with both random forest and naive Bayes, indicating that the 9 features belonging to feature set 1 are not enough to detect the nature of the news. The writeprints-based feature set 3 has an exhaustive set of features and gives much better accuracy, especially with the random forest classifier. We thus continue further only with feature set 3. In further sections, references to stylometric features imply reference to feature set 3.

    • All the features are not equally important in determining the authenticity of a piece of news. The extra trees classifier[35] is used to analyse the importance of the stylometric features. It fits a number of randomized decision trees on subsamples of the dataset. Then, averaging is used to improve the predictive accuracy and combat the problem of overfitting. The forest obtained from the classifier is used to get the importance of all the features in the dataset. In feature set 3 (Table 2), the ten features with highest importance values are “has quoted content”, “has URL”, “% of uppercase letters”, “frequency of punctuation”, “frequency of words of length 15”, “% of whitespaces”, “frequency of words of length 14”, “average sentence length in words”, “frequency of words of length 12” and “frequency of words of length 11”. In Fig. 3, it is observed that real news has a very high average number of quotes compared to fake news. This might be because real news is substantiated with quotes, thus verifying its authenticity. On the other hand, fake news does not contain any evidence and hence lacks enough quotes.

      Figure 3.  Statistics for quoted content

      In the case of uppercase letters (Fig. 4), the average percentage of uppercase characters in fake news is much higher than that in real news. This is because fake news is more dramatic with copious use of uppercase letters to make it a click-bait for the readers.

      Figure 4.  Statistics for upper case letters

      The most unimportant features which add no value to the dataset are frequency distribution of words with length of 21 characters, total number of lines and total numberof paragraphs. This is because the information regarding the lines and paragraphs is lost while collecting and integrating information from various websites.

    • Recursive feature elimination is used for selecting the 50 most important features in the feature set 3 (Table 4). We observe that the maximum accuracy with the random forest classifier on feature set 3 is obtained on selecting 50 features.

      #Selected FeaturesAcc. (%)Prec.Re.cF1
      35820.820.800.81
      40820.840.780.81
      45830.840.800.82
      50840.870.790.82

      Table 4.  Accuracies for random forest obtained after feature selection using recursive feature elimination

    • After selecting only 50 most important features, classification methods are applied on the dataset. The results obtained with RF, NB, SVM, LR and KNN are shown in Table 5. It is observed that RF gives the best performance. SVM overfits the data. The accuracy obtained on the training is almost 100%, but on the test set it is poor. The results of SVM did not improve despite varying the regularization parameter. SVM might be overfitting the highly noisy training data. One more disadvantage of the SVM is its high time complexity to train on large datasets. The results obtained by using ensemble methods have been tabulated in Table 6. Please note that GBC = general bagging classifier, ETC = extra trees classifier and GB = gradient boosting classifier.

      ClassifierAcc. (%)Prec.Rec.F1
      RF82.50.8380.800.819
      NB-Mutinomial670.640.760.69
      NB-Gaussian700.710.660.68
      SVM530.511.00.676
      LR75.70.7160.840.77
      KNN670.680.620.65

      Table 5.  Results with basic classifiers

      Ensemble modelAcc. (%)Prec.Rec.F1
      GBC with KNN69.60.7260.6140.665
      GBC with RF84.30.840.840.84
      ETC830.830.820.82
      GB860.860.850.86
      AdaBoost850.850.830.84

      Table 6.  Bagging and boosting

      Ensemble methods are known to give better accuracy than their constituents and thus bagging and boosting have been used to improve the accuracy. Gradient boosting gives an accuracy of 86%.

    • The performance of different classifiers on count, TF-IDF, CBOW and skip-gram vectors is evaluated. Table 7 shows the performance of NB, RL and LR classifiers on the dataset without pruning the vocabulary. It can also be inferred that the precision, recall and F1 scores of both NB and LR classifiers are good. LR performs better than the other two classifiers on both count and TF-IDF vectors.

      FeatureClassifierAcc. (%)Prec.Rec.F1
      BOW countNB87.500.900.830.87
      RF80.320.850.730.79
      LR92.080.92.920.92
      BOW TF-IDFNB78.920.970.590.73
      RF80.550.840.740.79
      LR89.490.880.910.895

      Table 7.  Results without any reduction in vocabulary dimension

      Tables 8-10 give the performance of the three classifiers on the vocabulary with different pruning and dimension reduction methods. For the results obtained in Table 8, the vocabulary dimension is reduced by using only the text pre-processing methods of lemmatization and stemming. LR performs better than the other two classifiers, especially with the TF-IDF vectors. In the case of Table 9, the chi-square test has also been used for dimensionality reduction along with stemming and lemmatization. In Table 10, only the chi-square test is used for vocabulary dimension reduction. It is observed that LR performs better on TF-IDF vectors compared to count vectors after vocabulary dimension reduction. This difference in accuracy is profound in the case when all the three methods are used for dimensionality reduction (Table 9).

      FeatureClassifierAcc. (%)Prec.Rec.F1
      BOW countNB82.250.880.740.80
      RF72.850.880.520.65
      LR83.730.880.770.824
      BOW TF-IDFNB78.990.970.590.74
      RF79.730.830.730.78
      LR89.720.880.9130.897

      Table 8.  Results with reduction in vocabulary dimension using only lemmatization and stemming

      FeatureClassifierAcc. (%)Prec.Rec.F1
      NB84.170.870.7950.83
      BOW countRF74.850.850.5980.70
      LR82.910.820.830.83
      NB82.840.960.680.796
      BOW TF-IDFRF82.020.850.770.81
      LR89.130.870.910.89

      Table 9.  Results with reduction in vocabulary dimension using lemmatization, stemming, chi-square test

      FeatureClassifierAcc. (%)Prec.Rec.F1
      NB84.980.880.800.84
      BOW countRF75.810.890.580.70
      LR85.050.900.780.84
      NB82.910.960.680.79
      BOW TF-IDFRF81.880.850.770.81
      LR89.280.880.910.89

      Table 10.  Classifier results for mentioned features with reduction in vocabulary dimension using chi-square test

      Figs. 5 and 6 are plots for count vectors and TF-IDF respectively that show the accuracy of NB and LR in the 4 cases: (1) without vocabulary dimension reduction, (2) dimension reduction using lemmatization and stemming, (3) dimension reduction using only the chi-square test for feature selection, and (4) dimension reduction using lemmatization, stemming and chi-square test. Observing the accuracy of NB and LR in the case of count vectors (Fig. 5), it is concluded that both the classifiers are performing well when the vocabulary is not pruned. Logistic regression is performing well on all the cases. Random forest′s performance is average but over-fitting is found in all the cases (not-pruned and pruned), hence it is not compared with LR and NB in the plots (Figs. 5 and 6). The TF-IDF curve (Fig. 6) implies that LR is performing better than NB in all the four cases and it is highest in the feature-selection and dimension reduction model. Performance of NB improves with the pruning of vocabulary. From Tables 7-10, it is clear that precision, recall and F1 scores of LR classifier are good.

      Figure 5.  Performance of classifiers on count vector features

      Figure 6.  Performance of classifiers on TF-IDF vector features

      The classifiers do not perform well on word vector features when method 1 structure for skip-gram and CBOW (as described in Section 3.3.2) is used as input to classifiers (Table 11).

      FeatureClassifierAcc. (%)Prec.Rec.F1
      CBOW-W2VNB50.890.670.010.01
      RF64.570.670.540.60
      CBOW-FTNB50.890.670.010.01
      RF64.940.670.560.61
      SG-W2VNB50.890.670.010.01
      RF71.230.740.650.69
      SG-FTNB50.890.670.010.01
      RF65.380.670.580.62

      Table 11.  Classifier results for CBOW, skip-gram (poor performance with method 1 vector structure as input)

      Fig. 7 shows the performance of word embedding models Word2Vec and FastText. This plot is obtained using the results of Table 12, where method 2 (described in Section 3.3.2) is used to give input for the classifier. CBOW-W2V, CBOW-FT, SG-W2V and SG-FT embedding methods are found to be performing very well with LR compared to all other methods. Overall, the results obtained with skip-gram and CBOW are better than those obtained by using count and TF-IDF vectors. NB′s performance is found to be average. Random forest was also used on the obtained word vector values but it has an average performance due to over-fitting on the training set.

      Figure 7.  Performance of classifiers on word to vector embedded features

      FeatureClassifierAcc. (%)Prec.Rec.F1
      CBOW-W2VNB80.320.750.900.82
      RF86.390.900.810.85
      LR93.420.940.930.93
      CBOW-FTNB76.630.700.920.80
      RF86.610.890.830.86
      LR92.30.930.920.92
      SG-W2VNB75.070.680.930.79
      RF86.390.900.810.85
      LR90.090.880.920.90
      SG-FTNB74.560.670.930.78
      RF86.170.880.830.86
      LR92.30.920.930.92

      Table 12.  Results for CBOW, skip-gram features (Good performance: Method 2 vector structure as input)

      The reduction of vocabulary dimension also did not solve this problem.

    • Both stylometric features (feature set 3) and the word vector features are used to get the combined prediction. The features are combined using ensemble methods: bagging, boosting and voting.

      1) Bagging

      In the bagging methodology, $ n $ random subsets (with replacement) are selected from the training set. $ m $ is the number of samples in each subset. $ m $ is taken as 700 for all cases. In our experiment, we take $ n $ subsets of the training set, each containing 700 samples (news pieces). For each subset, we train a RF classifier on the stylometric features of the samples in the subset and LR classifier on the word-vector representation of the samples. Hence, for $ n $ subsets, we obtain a total of $ n $ RF classifiers and $ n $ LR classifiers. These classifiers are then used to predict the label for the test set samples, with each test set sample getting 2n predictions ($ n $ RF classifiers applied to the stylometric features and $ n $ LR classifiers applied to the word vector representations of the test set samples). The mode of the predictions is taken as the final predicted label for each test sample. The results of this methodology of classification are tabulated in Table 13, with varying values of $ n $. The details have been explained in Section 3.6.

      Word vectorsnAcc. (%)Prec.Rec.F1
      SG (W2V)1086.760.870.860.86
      1587.50.870.870.87
      CBOW (FT)1090.010.910.880.89
      1590.900.920.890.91
      CBOW (W2V)1090.240.910.880.90
      1591.200.930.890.91
      SG (FT)1088.830.890.880.89
      1590.090.900.900.90

      Table 13.  Results of bagging on both feature set 3 and word vectors

      Skip-gram and CBOW (both Word2Vec and FastText) have been considered for word vector representations to applying bagging. The highest accuracy of 91.20% has been observed in the case of CBOW (W2V) with good precision, recall and F-score.

      2) Boosting

      Gradient boosting (Grad) and AdaBoost (Ada) are used to combine the feature set 3 and word vector based features, as shown in Table 14. It is observed that the gradient boosting classifier works the best with CBOW(W2V) and AdaBoost works well with all vector models.

      ModelWord vector modelAcc. (%)Prec.Rec.F1
      GradCBOW (FT)94.820.950.950.95
      CBOW (W2V)95.490.950.950.95
      SG (FT)95.120.950.950.95
      SG (W2V)95.120.950.950.95
      AdaCBOW (FT)94.530.950.940.94
      CBOW (W2V)94.530.950.940.94
      SG (FT)94.530.950.940.94
      SG (W2V)94.530.950.940.94

      Table 14.  Using boosting on feature set 3 + word vectors

      3) Voting

      All the voting methods used are “hard” voting as soft voting results are average (Tables 15 and 16).

      ClassifiersWeightsAcc. (%)Prec.Rec.F1
      NB+LR+Ada1
      180.840.770.880.82
      1
      2
      280.840.770.880.82
      1
      Bag+LR+Ada1
      184.020.820.870.84
      1
      2
      284.0260.820.870.84
      1
      LR+RF+Ada1
      190.090.8990.8990.899
      1
      2
      290.380.890.920.903
      1

      Table 15.  Results for voting on feature set 3 + TF-IDF (post feature selection)

      ClassifiersWVAcc. (%)Prec.Rec.F1
      NB+LR+RFCBOW (W2V)82.400.770.920.84
      CBOW (FT)83.210.770.930.84
      SG (W2V)78.030.720.920.80
      SG (FT)80.700.740.940.83
      NB+LR+AdaCBOW(W2V)84.620.780.960.86
      CBOW(FT)85.280.790.960.86
      SG(W2V)80.250.730.950.826
      SG (FT)82.170.740.970.84
      LR+RF+AdaCBOW(W2V)90.830.900.910.91
      CBOW(FT)90.900.900.920.91
      SG(W2V)91.200.900.920.91
      SG (FT)91.940.910.930.92

      Table 16.  Results for voting on feature set 3 + wordvector features (WV)

      The ensemble of logistic regression, random forest and Adaboost is applied on the combination of writeprints, stylometric features and skip-gram (FastText) to obtain voting results. This ensemble performed the best when compared to an ensemble of other classifiers with respect to voting.

      The maximum accuracy obtained by using bagging is 91.20% and voting on the combination of skip-gram (FT) and writeprints based features gives 91.94%, with precision, recall and F1-score above 90%. Boosting using gradient boosting algorithm on the combination of CBOW (Word2Vec) and stylometric features gives and accuracy of 95.49%, with a precision, recall and F-score of 95%. On studying the fake news which are mis-labeled as real news in the test set, we note that those mis-classified samples have a greater average quoted content than usual. On manually observing some of those mis-classified texts, we note that in some places quotes have been used to emphasize certain words or phrases.

      Table 17 gives the comparison of the proposed method with other text based methods. Hybrid CNN and RNN[38] identify fake news with an accuracy of 82% on a dataset based on tweets during five major events: Charlie Hebdo, Sydney Siege, Ottawa Shooting, Ferguson Shootings and Germanwings-Crash. However, deep learning models need large datasets. TF-IDF of bi-grams with stochastic gradient descent identifies pieces of fake news in the dataset published by Signal Media with an accuracy of 77.2%[9]. Gogate et al.[39] achieved an accuracy of 84% in unimodal deception detection using CNNs. In their work, they use audio, textual and visual cues but we compare only with the result obtained on text. In relation with the PolitiFact based news of FakeNewsNet, Shu et al.[16, 40] achieved an accuracy of 69% with Social Article Fusion. In another work that uses a combination of publisher-news relationships and user-news interactions, PolitiFact gives an accuracy of around 88%[41]. Paschalides et al.[42] developed a browser plugin using linguistic, stylistic, complexity and psychological features of news that gives an accuracy of 72% on the PolitiFact data.

      WorkMethodAcc. (%)
      Hybrid CNN and RNN[38] Automatic Feature Identification Using LSTM and CNN 82%
      TF-IDF bigrams[9] Stochastic Gradient LSTM and CNN 77.2%
      CNNs on Text[39] Unimodal deception detection through text with CNNs 84%
      Proposed Method Boosting on Stylometric and Word Vector Features 95.49%

      Table 17.  Comparison with other works (only text based)

      A positive research outcome has been obtained in the proposed work. By using boosting methodology on a combination of word vectors and stylometric features, a precision of 95% and recall of 95% are obtained. This implies that 95% of the fake news is detected successfully (high recall) with only textual analysis. The advantage of the proposed methodology is that only textual features are used. Features related to user information may not be available in some cases and media-based features increase the time complexity of processing.

    • The writeprints-based feature set is exhaustive for stylometric fake news detection but needs to be combined with word vectors to get better accuracy. The most important stylometric features for differentiating fake and real news include the amount of quoted content and uppercase letters. Ensemble methods including random forests, stochastic gradient descent and extra trees classifier worked the best on the stylometric features. In vectorized representation of text, TF-IDF vectors and skip-gram Word2Vec features are giving good results with a lower run time. Logistic regression is performing well for all types of word vector features, while the performance of the naive Bayes classifier is fluctuating. Individually, word vectors give much better accuracy compared to the accuracy obtained by applying classification models on stylometric features, thus underlining the importance of the implicit information present in the vectorized representation of text. Though word vector representations of text give better accuracies, we are able to achieve a good accuracy of 86% for stylometric features with gradient boosting classifier. We are also able to see that some stylometric features including the presence of quoted content and uppercase letters are significant for differentiating real and fake news, highlighting the importance of the style of writing news. We observe that random forest gives increasing accuracy as we go from feature set 1 to feature set 3. The writeprints feature set, which involves the combination of structural, syntax-based, lexical and content-specific features used for attributing authorship in short documents is the most exhaustive set of stylometric features[23]. These features have been used in the literature for authorship attribution of online content to achieve accuracies between 70% and 95%. These features work well in the case of fake news detection and they can capture the author′s style of writing without knowing anything about the source of the news, thus helping in alleviating the problem of not using any metadata related to the news.

      One important thing we notice in our work is the power of ensemble methods. The use of ensemble methods: bagging, boosting and voting have helped to improve the results significantly with both stylometric and word vector features. As we observe in the case of feature set 3 (in stylometric features), random forest gives an accuracy which is 6.8% more than the next best classifier, which is logistic regression. Gradient boosting and AdaBoost further outperform the random forest classifier. Even with a modestly sized training set, boosting is able to achieve a huge improvement because of its focus on news samples that are difficult to classify. This can be explained by the fact that boosting has helped to improve the predictive performance on several benchmark datasets spanning across different fields as it iteratively fits and gives more weight to samples that are mis-classified[43, 44]. An accuracy of 95.49% is obtained on using boosting method on the combination on both stylometric and word vector features. This also highlights that using only textual features can give a good accuracy in the identification of fake news without the need of relying on user-related information, which is often private.

      Our work is mainly based on the political domain and few other fields because of the minimal availability of curated and labelled datasets. Future research can focus on creation of more diverse datasets and their analysis with reduced timed complexity to make it ideal for real-time applications.

      Online training could be used to incorporate real time data. Images could be taken into consideration along with the features considered here in order to improve overall results, preferably using pretrained models. Newer datasets covering more domains can be used for better generalization.

Reference (44)

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return