There is increasing interest in approaching textual data sets through algorithmic data analysis tools (Grimmer and Stewart, 2013; Lucas et al., 2015). For such goals, one must transform the text data into a quantitative format. A simple approach for this is to use different words as âvariablesâ and count the number of times each word is in the text (see Figure 3.2). In Table 3.2b, `a' is a variable and occurred one time, `Alice' is a variable and occurred two times and `rabbit' did not occur at all in the example sentence. The sentence has features like `a', `Alice' and `without.' Therefore, the sentence is now presented in a quantitative format.
The example process shows some issues on the naÃ¯ve approach. While `a' is present in the sentence, it does not usually provide extensive value for analysis (like `an' or `the'). Therefore, we have variables that we assume to be only noise in the analysis. `Had' and `having' refer to same phenomena, but they are written differently for grammar rules (similarly, âpictureâ and âpicturesâ). Often when working with textual data, one must first pre-process it to remove common non-meaningful words (like `a', `and', `the') and transform the words to the same form or stem (`had' and `having' would both be `have'). Similarly, we removed punctuation (?, “, ” and , in this case), assuming it did not add value for the analysis. Preprocessing is seen as often a mechanical process is easily applied to the data set,- However, pre-processing choice can change the outcomes of algorithmic data analysis (e.g. Denny and Spirling 2018; we return to this topic in Chapter 10.1).
The above-mentioned process is known as a bag of words, highlighting that we are not aware of the word order or context in this analysis. This means that the quantification of `I love her' is almost like `I love her not'. There are also more advanced methods known as word embedding that seek to account for word order in the quantification of words. That is, word embedding examines words not as singular occurrences but in a context of nearby words. Alternatively, words can be analysed in the context of the sentence and automatically recognise subjects, objects and verbs in that sentence. Therefore, it is possible to avoid breaking context and word order as well, if they seem critical to understand the research question. However, quite often a bag-of-words approach is sufficient for the analysis.
Often the analysis is not focused on a single text example but a collection of texts. These are also documents, and the word count for each document (similar to Table 3.2b) can be stacked into a matrix. In a document-term matrix, each document is presented in a single row and each word in its own column. Choosing what are documents is often more a craft than science. Documents correspond to the unit of analysis in qualitative analysis. However, depending on the research question, different approaches are used. For example, if the aim is to characterise a Twitter user, one might argue that we combine all their tweets to a single document. If the focus is more on the temporal development of discussion in Twitter, each tweet could be an analysed document.