Data we are interested in are often not stored in program code itself files but in other, external files. These may include, for example, Microsoft Word or Excel documents, webpages or text files. At this stage, we focus on a simple data format known as comma-separated values, or CSV files. These types of files can be easily read to a computer, while Microsoft Word, Microsoft Excel and webpages require more extensive processing. CSV files are similar to spreadsheets: They have rows and columns. Each row is on its own line, and columns, storing values, are separated by commas.
Computationally, the code reads each line separately and conducts something into the values stored in that line. After finishing that line, the code continues to the next line until all lines of the CSV file have been processed. As we are repeating something, reading files (with several lines) utilises for-structure. Code Example 2.7 illustrates this idea. We are going through a file of Roman emperors and extracting the name, birth year and death year for each emperor. Each emperor is on their own line.
However, the example about repeating allowed us only to repeat the exact same command again and again. This does not work for reading files because they each have different values (separated by commas). We must access the values of the specific line examined, line by line. For this purpose we must use the iterator. The iterator is a variable that value changes for each round, or iteration, of the for-loop. So, when reading the CSV file, the iteration has first the value of the first line in the CSV file, then the value of the second line in the CSV file and so on. In the case of CSV files, Python and R operate a slightly different way with the iterators. In Python, the iterator stores values; that is, it stores the content of the line currently processed. In R, instead, the iterator stores the current line number, and we must collect the content of that line separately. (In R, there is a short hand for this: we can also access a whole column of the CSV file using the $ notation. For example, the command data$name - data$V1 sets to all rows the variable name as the first column of the data file. This may be used in some data manipulation exercises in a neat manner.)