Computers are powerful calculation machines, capable of doing billions of computations per second. However, there are limits to what they can do. Sometimes computation takes a lot of time. Some tasks are even theoretically impossible to solve with any computers.footnote To illustrate, when the number of items on a list grows, it takes more time to go through them. This puts limits on what can be solved using computers.
Sometimes the reason is because of the raw amount of data. However, it can be - and based on my experiences, it often is - more related to how the problem is approached. Sometimes code is inefficient and requires re-thinking the solution strategy. I was working to identify similar images from a data set of 10,000 images. I originally thought that the task was simple; running pairwise comparisons of each of the images is easy to do. I tested my code, similar to Code Example 5.16a, with a bunch of images, and it seemed to work well. I let a computer run the code for two days, but as it had not finished, I got worried. As I was comparing each image with others, I conducted comparisons, which is a total of 100 million comparisons. On a detailed analysis of the task, I noticed that each image comparison took about 10 seconds of time. Therefore, the analysis would have taken about 31 years to complete, give or take a few days.
This does not mean that the task was impossible (we later published results from the analysis in Hokka and Nelimarkka, 2020). It only meant that I was approaching the problem in the wrong way. I needed to re-think my strategy (see Code Example 5.16b). Instead of comparing each image pairwise using about 10 seconds for each comparison, I would summarise the images through a digital fingerprint, textual presentation of the image and its content. A fingerprint takes similarly about 10 seconds and for the 10,000 images about 28 hours. However, as I stored the image fingerprints, I did not need to compare each image to each other (which was slow). Rather, I compared textual representations with other textual representations, which is easy, and using dictionaries makes it even more efficient. The computations were finished in about a day, and we could proceed with the analysis.
To understand these kinds of factors, computer scientists study the algorithmic complexity. They analyse algorithms to understand how time spent and memory reserved relate to each algorithm and how increasing the data size impacts the algorithmic complexity. For example, Code Example 5.16a would have processed 100 images in a day. With 100,000 images, it would have taken about 3,000 years. Data size is clearly an important factor to consider - and if the data are small enough, even an inefficient solution might work. One day is not that long!
Algorithmic complexity is approached by examining the orders of growth or asking that if the data size increases how much does the time or memory of an algorithm increase. In best cases, the algorithm is not dependent on the size of the data. For example, picking a value from a list based on its index or from a dictionary based on its key does not depend on the size of the list or index. These are called constant algorithms, but they rarely exist in more realistic programs. More commonly, the increase of data increases the performance time. If we seek to find if a value is stored on a list, we might iterate through each value on the list and check if it is the value we are looking for. Every time we increase the list by one, we would need to do one more check. The number of steps the algorithm must take increases linearly with the size of the list. This is the usual case every time you have one for-loop in the program. If the program appears to be slow, it is possible to optimise the performance. Sometimes optimisation relates to choosing better data types. If we used a dictionary instead of a list, checking for a value for a potential key is a constant operation. Other times we can make further assumptions and use these to improve the performance. If we can assume that the list is sorted (or we sort the list), we can use this knowledge to benefit the computation. Instead of going through all values, we could use a binary search strategy: cut the list in half and identify on which side of the half our value should be if it exists. As the list is ordered, on each step we can outline half of the list as irrelevant for our analysis.
Both the constant order of growth [marked through for computer scientists] and the linear order of growth [ , where indicates the data size] behave rather nicely - it is hard to end up with 3,000 years of computation unless is extremely high. What makes Code Example 5.16a difficult is that it does not grow linearly. If we have 10 images, we would conduct image comparisons. With 11 images, the number of operations jumps to 121. The new added image must be compared to other images, leading to 10 additional comparisons (i.e. the new image will be considered as image2). Furthermore, all images must be compared with the new image (the new image is image1) 11 times (as the image is compared with itself). In total, adding one more image required 21 new comparisons. With 12 it is already 144.
Code Example 5.16b grows linearly ( ). When an image is added, the new image must be fingerprinted once (it is image once). There may be a modest increase on the second for-loop as well if a new fingerprint is added. However, adding one image requires two additional computational operations. When increasing the number of images from 11 to 12, similarly only two additional computational operations are required. The growth is linear. Every new image requires two additional operations. Instead, for Code Example 5.16a, growth is quadratic, or . (Note how there are two for-loops nested. This is a clear indicator that the growth is not linear in this case.) Quadratic growth and higher polynomial growths like can surprise with high values of . A good way to approximate algorithmic complexities is to choose some values of that grow, such as , and compute the algorithms out.
Sometimes an algorithm can grow even faster. The infamous example is the Travelling Salesman Problem. If there are cities to visit, in which order should one visit them so that the distance travelled is shortest. The naÃ¯ve approach to this problem is to examine all possible paths and store them. From the first chosen city, there are other options of where to go next, from which there are other cities to travel to. The total number of potential paths is . With this approach, the number of paths grows fast. With five cities, there are 120 different potential paths, and with seven cities, the number of potential paths is already 5,040. Formally, this is an algorithm. There are techniques to avoid the rapid increase of cases using heuristics, such as making assumptions and acknowledging that those assumptions mean that the program might not find the best solution.
In the context of this book, we do not focus on how to develop such heuristics or improve algorithmic analysis further. The reason is that often these issues are not the bottleneck for computational social science projects. First, the size of the data, while large when comparing to traditional social science projects, can be small for computers. Second, for researchers there are often looser requirements for software performance. I might have used the inefficient version of image comparison for a few hundred images, even though this would take a few days, and the efficient solution could process these in less than one hour. Why? Usually we are not in a hurry with the results. That said, it is good to know that computers' abilities have limits - the computing time and available memory limit.
I would be cautious when seeking to make the code faster, or optimising the code. First, it is unclear if it is truly needed. It is rare for small datasets (which we often manage) that the limits of a computersâ processing power limit the research. (Often, instead of writing our own code to improve efficiency, the difficult bits come from libraries others have developed already.) Second, when optimising code, one can create unforeseen errors if trying to be overly smart about images. However, sometimes there is a need to rethink the solution strategy. Quite often, this literally means that one must remove nested for-loops. As the example showed, small changes in the approach may help bring down 31 years to one day. However, sometimes the intuition is not sufficient. One can also profile the program to understand what causes bottlenecks to its execution. For example, writing files is a `slow' operation. Therefore, a for-loop that includes writing to a file can be slower than a for-loop storing items to a list and writing the list to a file in a single go. This example shows some of the difficulties that one may encounter when optimising code. So, if the program appears to take a long time, carefully think about what may be the root cause for the slowness, back this up by observing the code and identifying what causes the slowness and, only after, take action to reconsider the solution strategy.