The important difference mentally between traditional statistics and data science-driven approaches is that data scientists do not choose any single machine learning algorithm for their research problem. Rather, they apply several of them and identify which leads to the best outcomes with their data. Zhang and Counts (2015), for example, tested logistic regression, decision trees, random forests and support vector machines to seek to identify what characteristics of online discussion predict policy changes. Yu et al. (2008) examined different ways of using support vector machines and naÃ¯ve Bayes models to classify congress and house representatives to their parties. Both of these problems are supervised learning issues: In Zhang and Counts (2015), the independent variable is if a policy change occurred, while for Yu et al. (2008), it is Republican and Democrat parties. Both problems-and any supervised machine learning problems-can be tackled with any supervised machine learning approaches (such as naÃ¯ve Bayes). Therefore, data analysis is to some degree algorithmic agnostic.
This means data analysis must have practices to evaluate the quality of the models. A major concern is naturally how correct classifications are. Social scientists use R to understand how well the model explains the data. In quantitative content analysis, Cohen (kappa) or Cronbach (alpha) are used to measure intercoder reliability. The machine learning community similarly have measures that evaluate the quality of the classification, such as accuracy, confusion matrix, F1 score or mean absolute error (see Table 3.1 for details). All of them describe how well the classification compares with real data-the ground truth.
However, the approach for data analysis is problematic because it can lead to overfitting. The classification system may overly focus on features prominent in the data set but do not generalise to a larger population or data set. Machine learning researchers use two practices to address this challenge. First, they conduct a train-test split with their labelled example data. They separate part of the data for model development purposes only (training data). The rest of the data is stored to be used only when the model is ready (test data). When the model (based on training data) is ready, it is used to classify the test data. The values estimated through the model can be compared with the ground truth values in the test data, as it was also labelled. A similar kind of approach has already influenced how statistical analysis is done in some disciplines. This approach is an effective tool to provide some replication of the results without the need to collect `additionalâ data. refs Second, they use k-fold cross-validation with their training data. In cross-validation, the data analysis process is run several times with different smaller partitions of the data sets. This means that even during model development each model is evaluated with a similar train-test split approach. Based on these, the models fit with the training data can be better evaluated during model development.