In statistics, exploratory data analysis or EDA is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task. Exploratory data analysis is a loosely defined term that involves using graphics and basic sample statistics such as mean and median or standard deviation, to get a feeling for what information might be obtainable from your data set. EDA is a set of techniques that allows analysts to quickly look at data for trends, outliers, and patterns. The eventual goal of EDA is to obtain theories that can later be tested in the modeling step. Exploratory data analysis is an approach for data analysis that employs a variety of techniques, mostly graphical, to maximize insight into a data set. Uncover underlying structure, extract important variables, detect outliers an anomalies, test underlying assumptions, develop parsimonious models, and determine optimal factor settings. Three popular data analysis approaches are, classical, exploratory data analysis, and Bayesian. These three approaches are similar in that they all start with a general science engineering problem, and all will yield science engineering conclusions. The difference is the sequence and focus of the intermediate steps. For classical analysis, the data collection is followed by the imposition of a model. Formality, linearity for example, and the analysis, estimation, and testing that follows are focused on the parameters of that model. For EDA, the data collection is not followed by a model in position, rather it is followed immediately by analysis with a goal of inferring what model would be appropriate. Unlike the classical approach, the exploratory data analysis approach does not impose deterministic or probabilistic models on the data. On the contrary, the EDA approach allows the data to suggest at miscible models that best fit the data. Finally, for a Bayesian analysis, the analysts attempts to answer research questions about unknown parameters using probability statements based on prior data. They may bring their own domain knowledge and or expertise to the analysis as new information is obtained. So that's the purpose of Bayesian analysis is to determine posterior probabilities based on prior probabilities and new information. Posterior probabilities is the probability an event will happen after all evidence or background information has been taken into account. Prior probability is the probability an event will happen before you taken any new evidence into account. EDA techniques are generally graphical. They include scatter plots, box plots, histograms, etc. In the real world, data analysts freely mix elements of all of the above three approaches and other approaches as well. The above distinctions were made to emphasize the major differences among the three approaches. How is EDA Used in machine learning. As we mentioned, the exploratory data analysis approach does not impose deterministic or probabilistic models on the data. On the contrary, the EDA approach allows the data to suggest admissible models that best fit the data. For exploratory data analysis that focuses on the data, its structure, outliers, and models suggested by the data. Although there are other methods, exploratory data analysis is typically performed using the following methods. Unavariate analysis is the simplest form of analyzing data. Uni means one. So in other words, your data has only one variable. It doesn't deal with causes or relationships unlike regression. And its major purpose is to describe. It takes the data, it summarizes that data, and it finds patterns in the data. In this example you see two types of univariate data, categorical and continuous. With the categorical feature type you can perform numerical EDA using pandas crosstab function. And you can perform visual EDA using Seaborn's count plot function. With the continuous feature type you can perform numerical EDA using pandas describe function and you can visualize boxplots, distribution plots, and kernel density estimation plots or KDE plots in Python using Matplotlib or using Seaborn. There are many EDA tools at your disposal, but that is beyond the scope of this lesson. In this univariate data example, there's just one feature, ocean proximity with five categories. You can you Seaborn's count plot function to count the number of observations in each category. Our visualization is a simple bar chart. Bivariate analysis means the analysis of bivariate data. It is one of the simplest forms of statistical analysis and is used to find out if there is a relationship between two sets of values. It usually involves the variables X and Y. We can analyze bivariate data and multivariate data in Python using Matplotlib or using Seaborn. And there are other tools as well. One of the most powerful features of Seaborn is the ability to easily build conditional plots. This lets us see what the data looks like when segmented by one or more variables. The easiest way to do this is through their factor plot method, which is used to draw a categorical plot up to a facet grid. Seaborn's joint plot function draws a plot of two variables with bivariate and univariate graphs. Seaborn's factor plot map method can map a factor plot onto a KDE distribution or boxplot chart. A common plot of bivariate data is the simple line plot. In this example we use Seaborn's regplot function to visualize a linear relationship between two sets of features. In this case, trip distance our x label and fair amount our target appear to have a linear relationship. Note that although the majority of the data tend to group together in a linear fashion there are also outliers present as well. The purpose of an EDA is to find insights which will serve for data cleaning, preparation or transformation, which will ultimately be used in a machine learning algorithm. We use data analysis and data visualization at every step of the machine learning process. Where each data exploration, data cleaning, model building, presenting results. These steps will belong to one notebook. Let's have a look at some examples. A histogram is a graphical display of data using bars of different heights. In a histogram each bar groups numbers into ranges. Taller bars show that more data falls in that range. A histogram displays the shape and spread of continuous sample data. In this example we use [INAUDIBLE] plot function to plot a histogram of the feature, median house value. Another commonly used plot type is a simple scatterplot instead of plots being joined by line segments, as in a line plot. Here, the points are represented individually with a dot, circle or other shape. In this example, we use Matplotlib's apply plot function to plot a scatter plot. A scatterplot is a graph in which the values of two variables are plotted against two axes. The pattern of the resulting points, revealing any correlation that may be present. Here we can see that by plotting housing location, latitude on the X axis and longitude on the Y axis, we see that the resulting revealed correlation pattern is the state of California. In this example, we use Seaborn's heat map function to show correlations. A heat map is a graphical representation of data that uses a system of color coding to represent different values. For example, you can see the correlation between all the features in your data set. The lighter the shade, the stronger the correlation. This is a quick and easy way to see which features may influence your target. If you think about it, a heat map plots multiple variables and can be thought of as an example of multivariate graphical analysis. Another area of exploratory data analysis. So to summarize, data analysis, which is the second step in the ML Pipeline. Is a crucial milestone and must be used to prepare the data before model training. The purpose of exploratory data analysis includes being able to gain maximum insight into the data set and its underlying structure, as well as to create a list of outliers or other anomalies. And most importantly the ability to identify the most influential features. There are many more ways to explore, analyze and plot data. Make it a goal to expand your knowledge of them. Have fun.