Following algorithmic black boxes, occasionally software is found to not do its intended purpose. These are known as bugs - faults in the software process. For example, any time your coding exercise is not performing what you hoped to achieve - you have created a bug. Bugs are not only a phenomenon for junior programmers but take place also through business cases - serious and less serious. Since the 1980s, software bugs have even caused the loss of lives (Leveson and Turner, 1993). Academic research is not alienated from these bugs (Eklund et al., 2016). Table 10.2 shows what kinds of bugs one might create.
The existence of bugs and errors creates challenges for all software-intensive sciences. For Symons and Horner (2014), the main challenge is the complexity of the software. There can be many if-conditions or other conditions, and the number of potential paths in the programs increases. Therefore, estimating the errors becomes tricky as it is challenging to test all potential conditions correctly. In practice, it is unclear if this is necessarily to determine that the software functions correctly or if other kinds of tests can be used to support software testing as well (, ). We discussed clarity (Section 8.4.1) and automated testing tools (Section ), which are industry best practices, to find issues.
Such mistakes can be beyond the software. It is not rare to work with broken data (Pink et al., 2018) caused by missing records or broken research instruments. These lead to incomplete data, which, similar to mistakes in software programs, limit the validity of analysis results. For example, Twitter application programming interfaces (APIs) used for data collection do not provide representative samples of the whole content (Morstatter et al., 2014). Data dumps - even if coming from organisations that store and manage the data - can be broken and require checking and examining the data to understand what kinds of mistakes they may have (Pink et al., 2018). The challenge is not always that data sets would be missing content; sometimes there is extra content. For example, a data dump can easily include duplicate entries caused by mistakes in the data collection process. Similarly, there can be activities that we consider waste. For example, in online conversation data that I helped to examine, we found a lot of discussion about trains, which was unexpected. After exploring this deeper, we observed that the online conversation platform had bots that posted content to the discussion. One of the bots was called Lunch Train and posted regular messages about the team's lunch plans, mimicking train schedules. Similarly, in political communication research, some scholars are concerned about the degree to which they might study only bots and how that impacts the interpretation of the results.
Because of challenges like this, practitioners spent a significant amount of time understanding their data. A rule of thumb in industry settings is that it takes 80% of time to collect, clean, manage and pre-process data and about 20% of time to conduct analysis. Data can have issues ranging from missing records or missing values to data having content it should not have. `Rubbish in, rubbish out' is a principle that highlights that any method (computational or non-computational) will lead to nonsense results if the input is not clarified. Running any fancy analysis with rubbish, like badly broken data, will lead to rubbish as well. However, explicating and seeking to clarify issues that have broken the data is a time-consuming task. That said, it is less time consuming than fighting with peer reviewers or colleagues about problems data may have and their implications on the results - at least based on my experiences. Therefore, when considering computational social science methods, one needs not only to understand the context but also check that the data truly are what they should be.