We have mentioned in all chapters that real programmers rarely implement everything from scratch. Rather, they build on top of existing work. This allows them to access code faster. Libraries are more optimised and tested compared with code written by individuals. For example, because you now know how to compute means and variances, you could write that code yourself. However, this is not smart use of time and could lead to nasty mistakes. Instead you should use ready-made implementations available in the software already. In Python, these are available in a specific library called statistics, while in more statistically oriented R, these would be integrated to the main language itself. To get access to this library, we need to import it - say that we want to use it. In Python, this is done using the command import and in R using library. (Sometimes we may need to install them as well. Installation depends on your computing system.)
Code Example 8.5 shows how one would solve the problem of comparing two experimental samples when using the ready-made functions when possible. As there is no function for plotting values, we still use the function developed for Code Example 8.4 as is. The major benefit is reduced work when solving the problem because the focus is only on essential unique parts that add the most value to solving the problem.
The real value of libraries emerges when they are used in problems that are more complex than means and variance. For example, in network analysis we identified many different indicators, such as degree and betweenness. We saw the code to compute a degree from a network, but it was rather large. For both R and Python, there are libraries that specialise in network analysis that we use. These libraries can compute the indicators to all nodes via one line of code (see Code Example 8.6), allowing you to focus on the actual research problem.
Libraries have their own language that define what functions they provide, what input parameters each function has and what outcomes they produce. As Code Example 8.6 shows, with these network analysis libraries, the command degree was used to compute degrees. It took the data set formatted into a network as a parameter and computer degrees for all nodes. These functions are described in the documentation as what the library can do. They list the functions available, provide details on their required input variables (also called arguments or parameters) and describe the expected outcome. Often, they also include a few code examples describing what the package does.
Code Example 8.7 shows the full documentation for a function degree used in Code Example 8.6. It shows that changes like stating degree( g , mode = 'out', loops = FALSE ) would change the outcomes by only counting the degrees and not counting the ties connecting the node to itself in the degree. (The documentation uses the term vertices when referring to nodes and the term edges when referring to ties.) Similarly, if you do not want the degree to show raw degree numbers, but scale the degree to vary between 0 and 1, it needs to be normalised. The function allows this by invoking the parameter normalized: degree( g , mode = 'out', loops = FALSE, normalized = TRUE ). Some of the arguments have default values indicated in the argument list with = default value. Therefore, it is not necessary to define the identification of nodes we are interested in; by default, it is all the nodes in the graph.
If our aim is to study what predicts the happiness of Swedes, according to World Value Surveys (see Figure 3.3), the first step is to choose the relevant machine learning methods we want to examine and after that choose what libraries we want to use. For Python, we could use scikit-learn, a popular framework for any machine learning tasks. The documentation for decision tree learning is organised to include elaborated examples and discussion, which is another way to help developers with these libraries. The small sample focused on classification tasks. Code Example 8.8 illustrates how the library could be used. Beyond that, the documentation includes an overall description of cases where one should use decision trees, other functionalities of the decision tree library, observations on the computational complexity of these tasks and practical tips for using decision trees.
As Code Example 8.8 shows, the classifier takes two parameters as input: X and Y. X is a matrix of features, storing values for each of the features for all units. Y is a list of correct values, or labels. (The function could take many more parameters as well, as illustrated in the full application programming interface (API) document.) Using the command clf.fit we invoke train clf with the training data. Clf is a classifier focused on decision trees. We store the trained model into variable clf and can use it to predict classifications on previously unseen data. If we want to use this library, we need to formulate the data so that they have the correct X and Y, as shown in Code Example 8.9.
These two examples show how to work with documentation when using the library. Beyond documentation, a common strategy is to use search engines to find further details about how to use a package or resolve errors they cause. There are a huge number of available packages, and they constantly evolve. For convenience, Table 8.3 lists some of them for the four different methods we have discussed in this chapter.