For a project on exploratory data analysis using R, I chose the wine quality data set since I had some experience working with wine data from my Freshman Research Initiative days at UT. Back then, I was using a number of statistical software to analyze our research group’s data. We were making array sensors that can differentiate wine varietals. You can check out one work based on that here. In the process, I learned about statistical pattern recognition, particularly linear discriminant analysis and principal component analysis. But now, I am learning exploratory data analysis. Back when I was doing the analysis on our wine research data, we never did any exploratory analysis. So it’s good to learn this now than never.
The main thing I learned from the Udacity course was how to load and manipulate csv files and create plots for univariate, bivariate and multivariate analysis. I also learned how to determine which variables contribute the most to a particular output variable at this stage of the analysis (even before any machine learning is done).
The wine quality data set is actually two files, one for the red wines and one for the white wines. I initially just wanted to use the red wines data set but thought about adding the white wines data set towards the end. So I looked for a way to combine two data sets in R and found the use of
rbind(). It was pretty easy but I figured that I needed to some manipulation of the individiual csv files after loading them in R using the read.csv() command. The indiviual files have the same headers and the first one is X, the observation number, which I needed to get rid of prior to combining the two files. I also needed to create a new column, “type” which identifies the observations as red or white wine.
The main goal I had in my exploratory data analysis was to determine which variables influence wine quality. There were 11 input variables given:
The output variable is “quality” which takes a value of 0 to 10, with 10 being the highest. “Quality” was an integer data type. The rest were numeric variables. I had to convert “quality” into a categorical variable to before analyzing the data. This site explains how to create factor variables and ordered factor variables.
There were three parts to the project. Univariate analysis mainly looks at histograms. Bivariate analysis allowed me to look at the relationships between two variables and multivariate analysis was the most challenging part. I practically spent most of my time contructing plots by trial and error. The awesome thing about it is that as you create the plots, the more you understand the data.