Chapter 4 Data types and data structures

In this chapter, we will cover the following topics -

Data type
- Numbers
- Characters
- Logical
Data structure
- Vector
- Matrix
- Array
- Data frame
- List
Import and export data
- .csv
- .xlsx
Manipulating data

4.1 Useful data types in R

There are different data types used in R. We will touch upon a few, but heavily used, data types in this chapter. These are namely -

4.1.1 Numbers:

Numbers can be Integers, or can be floating numbers (decimal numbers). They are recognized as simple Numeric data type in R.

4.1.2 Character:

This type of data can be equivalent to categorical variable. In other programming language, this is equivalent to String or Text data type. Here in R, these can be nominal (just names) or ordinal (names with hidden order) variables.

4.1.3 Logical:

such as True or False. These can be the result of a boolean operation.

Let’s see an example of different data types -

participant.ID	Date	weight.in.kg	height.in.cm	smoker	Overall.health.condition
BMI1_001	01/11/2021	55.5	172	No	Intermediate
BMI1_002	02/11/2021	45.0	165	Yes	Good
BMI1_003	03/11/2021	75.0	150	No	Poor

Here in the table above, participant.ID, smoker, Overall.health.condition columns represent character data type. Of them, participant.ID and smoker columns contain nominal data, while the Overall.health.condition represents the ordinal data type.

The columns namely weight.in.kg and height.in.cm contain numeric data type.

A very very important information - Categorical data (nominal or ordinal) in R are also called Factors.

Overall.health.condition <- factor(c("Intermediate", "Good", "Poor", "Good", "Poor" ))
Overall.health.condition
#~~~~~~~~~~~~~~~~~~~~
Overall.health.condition <- factor(c("Intermediate", "Good", "Poor", "Good", "Poor"), 
                                   ordered = F)
Overall.health.condition
#~~~~~~~~~~~~~~~~~~~~
Overall.health.condition <- factor(c("Intermediate", "Good", "Poor", "Good", "Poor"), 
                                   ordered = T)
Overall.health.condition
#~~~~~~~~~~~~~~~~~~~~
Overall.health.condition <- factor(c("Intermediate", "Good", "Poor", "Good", "Poor"), 
                                   ordered = T, 
                                   levels = c("Poor", "Intermediate", "Good"))
Overall.health.condition

There are also complex data types, date data types etc, which we will not cover in this workshop.

4.2 Data structure

There are few useful pre-defined data structures that are heavily used in R. There are user defined data structures as well (there are no limits in this case). We will discuss a few of the former class -

4.2.1 Vector:

Let’s first think of a variable and value pair, like you have a variable x and it has a value of 2. Like what we write in R: x <- 2, which means, we assign the value 2 to x. This is equivalent to a scalar. You can put more than one value in a variable, and then it will be equivalent to a vector, which is storing more than one scalar values:

y <- c(1,2,3,4,5,6)

Here we have used the combine function c() to create a vector. Vector can hold different data-types individually, that means, one vector can hold only one type of data (numeric or character or logical etc). Just remember, if you want to hold multiple elements of same type (be it numeric values or characters), enclose them with the combine function (as above).

Each of the components of the vector is called its Element. You can call or refer the elements individually or as a group using their positional index. The positional index starts from 1, which you can refer to as one-based indexing. For example, if you want to call or retrieve the first element of vector y, you can type in y[1]. Similarly, you can refer to the 4th element by typing in y[4]. If you want to refer to from 2nd to the 4th elements of y, you type in y[c(2,3,4)] or you can use y[2:4].

A useful tip - for sequential numbers, you can use a short-cut using colon sign in between the minimum and maximum values (inclusive) of the range. Like, for 1 to 100, you type in 1:100.

4.2.1.1 Short exercise on Vector

Create a vector called my_vec and assign it with letters A b c D e. Then call third to fifth elements of the vector (on the Console). What happens if you call the sixth element of my_vec?

4.2.2 Matrix:

A matrix is a two-dimensional (row and column) array of a single data type. Matrices are created using matrix() function. Think of you having 20 numbers, from 1 to 20 and you want to create a matrix with 4 rows and 5 columns. So you type in:

myMatrix <- matrix(data=1:20,nrow = 4,ncol = 5)
myMatrix

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    5    9   13   17
## [2,]    2    6   10   14   18
## [3,]    3    7   11   15   19
## [4,]    4    8   12   16   20

If you want to subset the matrix you can do it easily. Always remember, matrix is a m row by n column (m x n) structure which you define in R as [m,n] or [row, column]. If you want to choose the value in the first row and the first column (which also indicates the element numeric 1 in myMatrix), you type in myMatrix[1,1]. If you want to select first two rows and 2nd to 4th columns, you type in myMatrix[c(1,2), c(2,3,4)] or myMatrix[1:2, 2:4].

4.2.2.1 Short exercise on Matrix

Create a vector called english_alphabets with all 26 english alphabets.[Hint: letters returns all 26 english alphabets in lower-case]
Create a matrix with 13 rows and 2 columns that accommodates all those alphabets.

4.2.3 Array:

You can accommodate your data in more than 2 dimensions using an array. We will not cover it in this workshop.

4.2.4 Data frame:

For you, data frame is the main data type that will have a vast utility in this course. You will import data in this format and manipulate data and export data in this format. Therefore, it’s imperative that you understand it very carefully. A data frame is a two-dimensional array. It is more general than a matrix and can contain different data types in it at the same time. However, each column contains a single data type (of course). The function to create a data frame is data.frame() and an easy way to create a data frame (using same-size vectors) is -

myDataFrame <- data.frame(vector1, vector2, vector3))

Let’s see the same data frame again and observe carefully how it is done.

participant.ID=c("BMI1_001", "BMI1_002", "BMI1_003")
Date=c("01/11/2021", "01/11/2021", "01/11/2021")
weight.in.kg=c(55.5,45,75)
height.in.cm=c(172,165,150)
smoker=c("No", "Yes", "No")
Overall.health.condition=c("Intermediate", "Good", "Poor")

myDataFrame <- data.frame(participant.ID, Date, weight.in.kg, height.in.cm, smoker, Overall.health.condition)

myDataFrame

##   participant.ID       Date weight.in.kg height.in.cm smoker
## 1       BMI1_001 01/11/2021         55.5          172     No
## 2       BMI1_002 01/11/2021         45.0          165    Yes
## 3       BMI1_003 01/11/2021         75.0          150     No
##   Overall.health.condition
## 1             Intermediate
## 2                     Good
## 3                     Poor

Remember, each of the vectors defining each column in the data frame has to be of equal length. Now, if you want to subset the data frame as you did from matrix, you can use the same technique. For example, if you want to know the weight and height of first two patients, you type in myDataFrame[c(1,2),c(3,4)]. Interestingly in the case of data frame, you can use the column names to do that, like -

myDataFrame[c(1,2),c("weight.in.kg", "height.in.cm")]

Further more, there is another level of flexibility for selecting a column information by using dataframe$columnName notation as in our case myDataFrame$weight.in.kg.

Now, you can inquire what are the column names by utilizing colnames() function and for the row names - rownames(). Here, in this case, we see there is no row name, rather they are indexed with 1 2 3 as character.

rownames(myDataFrame)

## [1] "1" "2" "3"

we don’t have row names yet, but we can assign them with row1, row2 and row2 by typing in

rownames(myDataFrame) <-  c("row1", "row2", "row3")

For simplicity, we will change the column names as well to simply column1, column2 etc -

colnames(myDataFrame) <- c("column1","column2","column3","column4","column5","column6")

And, let’s see the current look of our data-frame -

myDataFrame

##       column1    column2 column3 column4 column5      column6
## row1 BMI1_001 01/11/2021    55.5     172      No Intermediate
## row2 BMI1_002 01/11/2021    45.0     165     Yes         Good
## row3 BMI1_003 01/11/2021    75.0     150      No         Poor

Now, if we want to subset using row and column names, we can type in -

subDataFrame <- myDataFrame[c("row1","row2"),c("column2", "column3")]
subDataFrame

##         column2 column3
## row1 01/11/2021    55.5
## row2 01/11/2021    45.0

Important nomenclature alert: In the realm of statistics, we refer a row as an observation and a column as a variable. In the realm of data analysts, they are records and fields, respectively. And, in the realm of Machine Learning, they are examples and attributes, respectively.

4.2.4.1 Short exercise on Data Frame

Re-create the same data-frame with a different name (of your choice) with the following vectors -

participant.ID=c("BMI1_001", "BMI1_002", "BMI1_003")
Date=c("01/11/2021", "01/11/2021", "01/11/2021")
weight.in.kg=c(55.5,45,75)
height.in.cm=c(172,165,150)
smoker=c("No", "Yes", "No")
Overall.health.condition=c("Intermediate", "Good", "Poor")

Subset the columns weight.in.kg and smoker with -

indexing and
column names

4.2.5 List:

In R, List is the (arguably) the most complex data structure which can hold different data types and data structures. Even, a list can hold another list. We will not cover this data structure in this workshop.

4.3 Import and export data

For this course, we will limit ourselves to import data from our personal computers. Before that, go to the following link https://doi.org/10.5281/zenodo.6452121, download an excel file called excell_iris_data.xlsx and save to your working directory. Do the same for the csv file as well.

The basic mode of reading a tabular dataset is the read.table() function. There are few others, like read.csv(), read.csv2(), read.ftable(), read.delim() etc. There are some basic parameter that you should be careful of -

file = /path/to/the/file
header = the default is set to TRUE, however, if the data doesn’t have a column name, set it to false.
sep = for read.table(), it is set to white space. For read.csv(), it is a comma. For read.csv2(), it’s set to a semi-colon. For read.delim() or read.delim2(), it’s is a tab "\t".
row.names = if there’s a column for row name (usually the first column in the raw data), specify with the column number and later get rid of that column using say, data[,column_number] <- NULL

Now we will read the csv file -

df_csv <- read.csv(file = "csv_iris_data.csv", row.names = 1)
head(df_csv)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Now that we have read our data-frame and checked few rows from the top of it, we want to export it to our computer -

write.csv(x = df_csv, file = "My_output_df.csv")

In the case of reading an excel file, either with .xls or .xlsx extension, we need to load a package called readxl.

# install.packages("readxl")
# library(readxl)
# df_excel <- read_excel('excell_iris_data.xlsx',sheet='Sheet1')
# head(df_excel)

Unfortunately, to write an excel file, we need another package -

# install.packages("xlsx")
# library(xlsx)
# write.xlsx(x=df_excel, "My_output_df.xlsx")

However, in some cases loading package xlsx in your R session may create an error message due to incompatibility with Java version on your computer (and that’s why I have commented out those two code snippet above). You may fix this problem my installing appropriate version of Java (there’s a solution on stackoverflow). Alternatively, you can load a fantastic package called rio and use its import() function to load your .xls or .xlsx files and export() function to write your output data in those formats as well -

# install.packages("rio")
library(rio)
df_excel <- import('excell_iris_data.xlsx',sheet='Sheet1')
head(df_excel)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Accordingly, for exporting (or writing) the file on your computer -

# install.packages("rio")
library(rio)
export(x=df_excel, "My_output_df.xlsx")

4.4 Manipulating data

First thing, I would do to check after loading a data-frame to the R environment is checking its dimension, look at the top and bottom of the dataset, look at the structure and summary of the data by following set of commands -

# check the dimension of the data
dim(df_csv)

## [1] 150   5

# check the head or tail
head(df_csv)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

tail(df_csv)

##     Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
## 145          6.7         3.3          5.7         2.5 virginica
## 146          6.7         3.0          5.2         2.3 virginica
## 147          6.3         2.5          5.0         1.9 virginica
## 148          6.5         3.0          5.2         2.0 virginica
## 149          6.2         3.4          5.4         2.3 virginica
## 150          5.9         3.0          5.1         1.8 virginica

# apply str() and summary() function 
str(df_csv)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : chr  "setosa" "setosa" "setosa" "setosa" ...

summary(df_csv)

##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##    Species         
##  Length:150        
##  Class :character  
##  Mode  :character  
##                    
##                    
##

We have covered how to change the row or column names in the section Data frame

Now we will see how to add new column or row to the data-frame. As we have seen that there are 150 rows in df_csv data-frame, therefore, we need a column that contains 150 data points. Let’s create a vector random_numbers that holds 150 random numbers -

set.seed(seed = 123)
random_numbers <- round(rnorm(150),2)
head(random_numbers)

## [1] -0.56 -0.23  1.56  0.07  0.13  1.72

length(random_numbers)

## [1] 150

Now, we will add this vector to the data-frame as a column -

new_df <- cbind(df_csv, random_numbers)
# alternatively,
#df_csv$random_number_column <- random_numbers

head(new_df)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species random_numbers
## 1          5.1         3.5          1.4         0.2  setosa          -0.56
## 2          4.9         3.0          1.4         0.2  setosa          -0.23
## 3          4.7         3.2          1.3         0.2  setosa           1.56
## 4          4.6         3.1          1.5         0.2  setosa           0.07
## 5          5.0         3.6          1.4         0.2  setosa           0.13
## 6          5.4         3.9          1.7         0.4  setosa           1.72

Similarly, to add a new row to the current data-frame, let’s create another vector of 6 matched data-type elements to match the six columns in the new_df and add to the end of the data-frame -

new_row <- c(mean(new_df$Sepal.Length), 
             mean(new_df$Sepal.Width),
             mean(new_df$Petal.Length),
             mean(new_df$Petal.Width),
             "random_sp",
             mean(new_df$random_numbers)
             )

new_df2 <- rbind(new_df, new_row)

tail(new_df2)

##         Sepal.Length      Sepal.Width Petal.Length      Petal.Width   Species
## 146              6.7                3          5.2              2.3 virginica
## 147              6.3              2.5            5              1.9 virginica
## 148              6.5                3          5.2                2 virginica
## 149              6.2              3.4          5.4              2.3 virginica
## 150              5.9                3          5.1              1.8 virginica
## 151 5.84333333333333 3.05733333333333        3.758 1.19933333333333 random_sp
##          random_numbers
## 146               -0.53
## 147               -1.46
## 148                0.69
## 149                 2.1
## 150               -1.29
## 151 -0.0244666666666667

Now, we will touch on another important topic today - dealing with missing data. Missing data in R are represented by the symbol NA (not available) or impossible values (i.e. dividing by zero) are represented by the symbol NaN (not a number). Here we will deal with only NA which is the most common in day-to-day R. If we have a vector x which contains one or more NA values, we use is.na() function to inquire about NA values-

x <- c(1,2,NA,3,4,NA,5,6)
is.na(x)

## [1] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE

any(is.na(x))

## [1] TRUE

The problem with the missing value(s) is that it hampers the simple operations, like - calculating the mean or sum etc -

sum(x)

## [1] NA

mean(x)

## [1] NA

However, if you include a magic argument na.rm=TRUE, then the problem can be overcome and those operations can be performed on rest of the valid elements of the vector x -

sum(x, na.rm = TRUE)

## [1] 21

mean(x, na.rm = TRUE)

## [1] 3.5

However, a clever way to handle the missing data points is to impute them, and a basic way of doing so is to take the mean of the rest of the data points, and replace the NA values with the mean -

x[is.na(x)] <- mean(x, na.rm = T)

Obviously, there are more complex ways to impute the missing data, we will not cover those in this workshop.

4.4.0.1 Short exercise on manipulating data

Please do the following steps -

Download the file named “csv_iris_data_NA.csv” from https://doi.org/10.5281/zenodo.6452121 and import it on your R environment.
Find out how many data points are missing.
Impute the missing data points with the median value of the respective columns. Round the values of the imputed data points to 2 decimal points using the function round()
What are the sums of each column (with numeric values)?