September 15, 2018

My first foray in coding was brought by my interest in data science. At that time, I thought my background was good enough to transition from chemistry to data science. As a postdoc for a chemistry professor, I used to analyze data using multivariate data analysis. I then went to take an online course on data analysis. Udacity’s Data Analyst Nanodegree projects include one that is a little bit similar to what I used to do—analyze data on difference wine varietals.

For the project on *exploratory data analysis* using R, I chose the wine quality data set since I had some experience working with wine data from my Freshman Research Initiative days at UT. Back then, I was using a statistical software to analyze our research group’s data. We were making array sensors that can differentiate wine varietals. You can check out one work based on that here.

In the process, I learned about statistical pattern recognition, particularly linear discriminant analysis and principal component analysis. But now, I am learning exploratory data analysis. Back when I was doing the analysis on our wine research data, we never did any exploratory analysis. (Shame, shame.)

The main thing I learned from the Udacity course was how to load and manipulate csv files and create plots for univariate, bivariate and multivariate analysis. I also learned how to determine which variables contribute the most to a particular output variable at this stage of the analysis (even before any machine learning is done).

The wine quality data set is actually two files, one for the red wines and one for the white wines. I initially just wanted to use the red wines data set but thought about adding the white wines data set towards the end. So I looked for a way to combine two data sets in R and found the use of `rbind()`

. It was pretty easy but I figured that I needed to some manipulation of the individiual csv files after loading them in R using the `read.csv()`

command. The indiviual files have the same headers and the first one is X, the observation number, which I needed to get rid of prior to combining the two files. I also needed to create a new column, “type” which identifies the observations as red or white wine.

```
# loading the files
redwine <- read.csv('wineQualityReds.csv')
whitewine <- read.csv('wineQualityWhites.csv')
# removing the 'X' column
redwine$X <- NULL
whitewine$X <- NULL
# creating the 'type' column; assigning "1" to red and "0" to white
redwine$type <- 1
whitewine$type <- 0
# combining the two data frames to make another data frame
wines <- rbind(redwine, whitewine)
```

The main goal I had in my exploratory data analysis was to determine which variables influence wine quality. There were 11 input variables given:

- fixed.acidity
- volatile.acidity
- citric.acid
- residual.sugar
- chlorides
- free.sulfur.dioxide
- total.sulfur.dioxide
- density
- pH
- sulphates
- alcohol

The output variable is “quality” which takes a value of 0 to 10, with 10 being the highest. “Quality” was an integer data type. The rest were numeric variables. I had to convert “quality” into a categorical variable to before analyzing the data. This site explains how to create factor variables and ordered factor variables.

There were three parts to the project. Univariate analysis mainly looks at histograms. Bivariate analysis allowed me to look at the relationships between two variables and multivariate analysis was the most challenging part. I practically spent most of my time contructing plots by trial and error. The awesome thing about it is that as you create the plots, the more you understand the data.

To see my final project, here is the html output of my Rmd file. To see the corresponding R codes, click here.