Chapter 4 Data types and data structures
In this chapter, we will cover the following topics -
- Data type
- Numbers
- Characters
- Logical
- Data structure
- Vector
- Matrix
- Array
- Data frame
- List
- Import and export data
- .csv
- .xlsx
- Manipulating data
4.1 Useful data types in R
There are different data types used in R. We will touch upon a few, but heavily used, data types in this chapter. These are namely -
4.1.1 Numbers:
Numbers can be Integers, or can be floating numbers (decimal numbers). They are recognized as simple Numeric
data type in R.
4.1.2 Character:
This type of data can be equivalent to categorical variable. In other programming language, this is equivalent to String or Text data type. Here in R, these can be nominal (just names) or ordinal (names with hidden order) variables.
4.1.3 Logical:
such as True or False. These can be the result of a boolean operation.
Let’s see an example of different data types -
participant.ID | Date | weight.in.kg | height.in.cm | smoker | Overall.health.condition |
---|---|---|---|---|---|
BMI1_001 | 01/11/2021 | 55.5 | 172 | No | Intermediate |
BMI1_002 | 02/11/2021 | 45.0 | 165 | Yes | Good |
BMI1_003 | 03/11/2021 | 75.0 | 150 | No | Poor |
Here in the table above, participant.ID, smoker, Overall.health.condition columns represent character data type. Of them, participant.ID and smoker columns contain nominal data, while the Overall.health.condition represents the ordinal data type.
The columns namely weight.in.kg and height.in.cm contain numeric data type.
A very very important information - Categorical data (nominal or ordinal) in R are also called Factors.
Overall.health.condition <- factor(c("Intermediate", "Good", "Poor", "Good", "Poor" ))
Overall.health.condition
#~~~~~~~~~~~~~~~~~~~~
Overall.health.condition <- factor(c("Intermediate", "Good", "Poor", "Good", "Poor"),
ordered = F)
Overall.health.condition
#~~~~~~~~~~~~~~~~~~~~
Overall.health.condition <- factor(c("Intermediate", "Good", "Poor", "Good", "Poor"),
ordered = T)
Overall.health.condition
#~~~~~~~~~~~~~~~~~~~~
Overall.health.condition <- factor(c("Intermediate", "Good", "Poor", "Good", "Poor"),
ordered = T,
levels = c("Poor", "Intermediate", "Good"))
Overall.health.condition
There are also complex data types, date data types etc, which we will not cover in this workshop.
4.2 Data structure
There are few useful pre-defined data structures that are heavily used in R. There are user defined data structures as well (there are no limits in this case). We will discuss a few of the former class -
4.2.1 Vector:
Let’s first think of a variable and value pair, like you have a variable x
and it has a value of 2
. Like what we write in R:
x <- 2
, which means, we assign the value 2
to x
. This is equivalent to a scalar.
You can put more than one value in a variable, and then it will be equivalent to a vector, which is storing more than one scalar values:
y <- c(1,2,3,4,5,6)
Here we have used the combine function c()
to create a vector. Vector can hold different data-types individually, that means, one vector can hold only one type of data (numeric or character or logical etc). Just remember, if you want to hold multiple elements of same type (be it numeric values or characters), enclose them with the combine function (as above).
Each of the components of the vector is called its Element. You can call or refer the elements individually or as a group using their positional index. The positional index starts from 1
, which you can refer to as one-based indexing. For example, if you want to call or retrieve the first element of vector y
, you can type in y[1]
. Similarly, you can refer to the 4th element by typing in y[4]
. If you want to refer to from 2nd to the 4th elements of y
, you type in y[c(2,3,4)]
or you can use y[2:4]
.
A useful tip - for sequential numbers, you can use a short-cut using colon sign in between the minimum and maximum values (inclusive) of the range. Like, for 1 to 100, you type in 1:100
.
4.2.2 Matrix:
A matrix is a two-dimensional (row and column) array of a single data type. Matrices are created using matrix()
function. Think of you having 20 numbers, from 1 to 20 and you want to create a matrix with 4 rows and 5 columns. So you type in:
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 5 9 13 17
## [2,] 2 6 10 14 18
## [3,] 3 7 11 15 19
## [4,] 4 8 12 16 20
If you want to subset the matrix you can do it easily. Always remember, matrix is a m
row by n
column (m x n)
structure which you define in R as [m,n]
or [row, column]
. If you want to choose the value in the first row and the first column (which also indicates the element numeric 1
in myMatrix), you type in myMatrix[1,1]
. If you want to select first two rows and 2nd to 4th columns, you type in myMatrix[c(1,2), c(2,3,4)]
or myMatrix[1:2, 2:4]
.
4.2.3 Array:
You can accommodate your data in more than 2 dimensions using an array. We will not cover it in this workshop.
4.2.4 Data frame:
For you, data frame is the main data type that will have a vast utility in this course. You will import data in this format and manipulate data and export data in this format. Therefore, it’s imperative that you understand it very carefully. A data frame is a two-dimensional array. It is more general than a matrix and can contain different data types in it at the same time. However, each column contains a single data type (of course). The function to create a data frame is data.frame()
and an easy way to create a data frame (using same-size vectors) is -
myDataFrame <- data.frame(vector1, vector2, vector3))
Let’s see the same data frame again and observe carefully how it is done.
participant.ID=c("BMI1_001", "BMI1_002", "BMI1_003")
Date=c("01/11/2021", "01/11/2021", "01/11/2021")
weight.in.kg=c(55.5,45,75)
height.in.cm=c(172,165,150)
smoker=c("No", "Yes", "No")
Overall.health.condition=c("Intermediate", "Good", "Poor")
myDataFrame <- data.frame(participant.ID, Date, weight.in.kg, height.in.cm, smoker, Overall.health.condition)
myDataFrame
## participant.ID Date weight.in.kg height.in.cm smoker
## 1 BMI1_001 01/11/2021 55.5 172 No
## 2 BMI1_002 01/11/2021 45.0 165 Yes
## 3 BMI1_003 01/11/2021 75.0 150 No
## Overall.health.condition
## 1 Intermediate
## 2 Good
## 3 Poor
Remember, each of the vectors defining each column in the data frame has to be of equal length. Now, if you want to subset the data frame as you did from matrix, you can use the same technique. For example, if you want to know the weight and height of first two patients, you type in myDataFrame[c(1,2),c(3,4)]
. Interestingly in the case of data frame, you can use the column names to do that, like -
myDataFrame[c(1,2),c("weight.in.kg", "height.in.cm")]
Further more, there is another level of flexibility for selecting a column information by using dataframe$columnName
notation as in our case myDataFrame$weight.in.kg
.
Now, you can inquire what are the column names by utilizing colnames()
function and for the row names - rownames()
. Here, in this case, we see there is no row name, rather they are indexed with 1 2 3 as character.
## [1] "1" "2" "3"
we don’t have row names yet, but we can assign them with row1
, row2
and row2
by typing in
For simplicity, we will change the column names as well to simply column1
, column2
etc -
And, let’s see the current look of our data-frame -
## column1 column2 column3 column4 column5 column6
## row1 BMI1_001 01/11/2021 55.5 172 No Intermediate
## row2 BMI1_002 01/11/2021 45.0 165 Yes Good
## row3 BMI1_003 01/11/2021 75.0 150 No Poor
Now, if we want to subset using row and column names, we can type in -
## column2 column3
## row1 01/11/2021 55.5
## row2 01/11/2021 45.0
Important nomenclature alert: In the realm of statistics, we refer a row as an observation and a column as a variable. In the realm of data analysts, they are records and fields, respectively. And, in the realm of Machine Learning, they are examples and attributes, respectively.
4.2.4.1 Short exercise on Data Frame
- Re-create the same data-frame with a different name (of your choice) with the following vectors -
participant.ID=c("BMI1_001", "BMI1_002", "BMI1_003")
Date=c("01/11/2021", "01/11/2021", "01/11/2021")
weight.in.kg=c(55.5,45,75)
height.in.cm=c(172,165,150)
smoker=c("No", "Yes", "No")
Overall.health.condition=c("Intermediate", "Good", "Poor")
- Subset the columns
weight.in.kg
andsmoker
with -
- indexing and
- column names
4.3 Import and export data
For this course, we will limit ourselves to import data from our personal computers. Before that, go to the following link https://doi.org/10.5281/zenodo.6452121, download an excel file called excell_iris_data.xlsx
and save to your working directory. Do the same for the csv file as well.
The basic mode of reading a tabular dataset is the read.table()
function. There are few others, like read.csv(), read.csv2(), read.ftable(), read.delim() etc. There are some basic parameter that you should be careful of -
- file = /path/to/the/file
- header = the default is set to
TRUE
, however, if the data doesn’t have a column name, set it to false. - sep = for
read.table()
, it is set to white space. Forread.csv()
, it is a comma. Forread.csv2()
, it’s set to a semi-colon. Forread.delim()
orread.delim2()
, it’s is a tab"\t"
. - row.names = if there’s a column for row name (usually the first column in the raw data), specify with the column number and later get rid of that column using say,
data[,column_number] <- NULL
Now we will read the csv file -
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Now that we have read our data-frame and checked few rows from the top of it, we want to export it to our computer -
In the case of reading an excel file, either with .xls or .xlsx extension, we need to load a package called readxl
.
# install.packages("readxl")
# library(readxl)
# df_excel <- read_excel('excell_iris_data.xlsx',sheet='Sheet1')
# head(df_excel)
Unfortunately, to write an excel file, we need another package -
However, in some cases loading package xlsx
in your R
session may create an error message due to incompatibility with Java
version on your computer (and that’s why I have commented out those two code snippet above). You may fix this problem my installing appropriate version of Java
(there’s a solution on stackoverflow). Alternatively, you can load a fantastic package called rio
and use its import()
function to load your .xls or .xlsx files and export()
function to write your output data in those formats as well -
# install.packages("rio")
library(rio)
df_excel <- import('excell_iris_data.xlsx',sheet='Sheet1')
head(df_excel)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Accordingly, for exporting (or writing) the file on your computer -
4.4 Manipulating data
First thing, I would do to check after loading a data-frame to the R environment is checking its dimension, look at the top and bottom of the dataset, look at the structure and summary of the data by following set of commands -
## [1] 150 5
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 145 6.7 3.3 5.7 2.5 virginica
## 146 6.7 3.0 5.2 2.3 virginica
## 147 6.3 2.5 5.0 1.9 virginica
## 148 6.5 3.0 5.2 2.0 virginica
## 149 6.2 3.4 5.4 2.3 virginica
## 150 5.9 3.0 5.1 1.8 virginica
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : chr "setosa" "setosa" "setosa" "setosa" ...
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## Length:150
## Class :character
## Mode :character
##
##
##
We have covered how to change the row or column names in the section Data frame
Now we will see how to add new column or row to the data-frame. As we have seen that there are 150 rows in df_csv
data-frame, therefore, we need a column that contains 150 data points. Let’s create a vector random_numbers
that holds 150 random numbers -
## [1] -0.56 -0.23 1.56 0.07 0.13 1.72
## [1] 150
Now, we will add this vector to the data-frame as a column -
new_df <- cbind(df_csv, random_numbers)
# alternatively,
#df_csv$random_number_column <- random_numbers
head(new_df)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species random_numbers
## 1 5.1 3.5 1.4 0.2 setosa -0.56
## 2 4.9 3.0 1.4 0.2 setosa -0.23
## 3 4.7 3.2 1.3 0.2 setosa 1.56
## 4 4.6 3.1 1.5 0.2 setosa 0.07
## 5 5.0 3.6 1.4 0.2 setosa 0.13
## 6 5.4 3.9 1.7 0.4 setosa 1.72
Similarly, to add a new row to the current data-frame, let’s create another vector of 6 matched data-type elements to match the six columns in the new_df
and add to the end of the data-frame -
new_row <- c(mean(new_df$Sepal.Length),
mean(new_df$Sepal.Width),
mean(new_df$Petal.Length),
mean(new_df$Petal.Width),
"random_sp",
mean(new_df$random_numbers)
)
new_df2 <- rbind(new_df, new_row)
tail(new_df2)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 146 6.7 3 5.2 2.3 virginica
## 147 6.3 2.5 5 1.9 virginica
## 148 6.5 3 5.2 2 virginica
## 149 6.2 3.4 5.4 2.3 virginica
## 150 5.9 3 5.1 1.8 virginica
## 151 5.84333333333333 3.05733333333333 3.758 1.19933333333333 random_sp
## random_numbers
## 146 -0.53
## 147 -1.46
## 148 0.69
## 149 2.1
## 150 -1.29
## 151 -0.0244666666666667
Now, we will touch on another important topic today - dealing with missing data. Missing data in R are represented by the symbol NA (not available) or impossible values (i.e. dividing by zero) are represented by the symbol NaN (not a number). Here we will deal with only NA which is the most common in day-to-day R. If we have a vector x
which contains one or more NA values, we use is.na()
function to inquire about NA values-
## [1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE
## [1] TRUE
The problem with the missing value(s) is that it hampers the simple operations, like - calculating the mean or sum etc -
## [1] NA
## [1] NA
However, if you include a magic argument na.rm=TRUE
, then the problem can be overcome and those operations can be performed on rest of the valid elements of the vector x
-
## [1] 21
## [1] 3.5
However, a clever way to handle the missing data points is to impute them, and a basic way of doing so is to take the mean of the rest of the data points, and replace the NA values with the mean -
Obviously, there are more complex ways to impute the missing data, we will not cover those in this workshop.
4.4.0.1 Short exercise on manipulating data
Please do the following steps -
- Download the file named “csv_iris_data_NA.csv” from https://doi.org/10.5281/zenodo.6452121 and import it on your R environment.
- Find out how many data points are missing.
- Impute the missing data points with the median value of the respective columns. Round the values of the imputed data points to 2 decimal points using the function round()
- What are the sums of each column (with numeric values)?