Chapter 1 Introduction

This is the second and the last workshop in this series for this year. In the first workshop, I introduced very basic functions of R for data handling and generating basic plots. Some of you wandered about the utility of R in your biological research in the coming months. That might be due to my choice of very simple datasets that came with base R. I was consciously avoiding a bit complex real-life datasets (especially related to molecular biology or omics) so that you (at least ~70% of you) don’t become startled with it while encountering R syntaxes for the first time.

In the first half of today’s workshop, we will learn a more efficient way of handling / manipulating data using an R package called dplyr and generate plot using another package called ggplot2. However, we will still be using in-built data from base R. Don’t be disheartened; soon we will shift our attention to real-life / clinical data. We will be using few datasets that were part of a study called METABRIC (Molecular Taxonomy of Breast Cancer International Consortium). These datasets characterise the genomic mutations (SNVs and CNAs) and gene expression profiles from over 2000 primary breast tumours. In addition, a detailed clinical information can also be found for this study alongside the experimental data from cBioPortal, which we will integrate to the latter. You can follow the little download sign on that page or you can click here to download the dataset. Save the brca_metabric.tar.gz file to somewhere on your computer and decompress it. We will import some of the files from here.

In this workshop, we are not planning to do any major data analysis, rather we will stick to the realm of (the fancy name) Exploratory data analysis (EDA) by formatting data and plotting some informative plots. We will learn few but important functions (or, verbs) to perform data manipulation. We will find out which was the most prominent among different mutation types. We will also generate a word cloud using most affected genes in the patient cohort.

We will see the expression of GATA3 transcription factor in PAM50 classified samples or samples with different ER status. We will also see the age distribution of the patients for some selected mutated genes. Lastly, we will explore the concept of co-occurrence of mutations among some cancer related genes in the METABRIC cohort.