Which of the following is the correct definition of exploratory data analysis?

Approach Exploratory Data Analysis (EDA) is an approach/philosophy for data analysis that employs a variety of techniques (mostly graphical) to

Nội dung chính Show

Exploratory Data Analysis with Chartio
Univariate Analysis
Multivariate analysis
Scatter Plot
Which of the following is the correct definition of exploratory data analysis quizlet?
What is meant by exploratory data analysis?
Which of the following options best describes exploratory data analysis?
What is your definition of exploratory data visualization?

maximize insight into a data set;
uncover underlying structure;
extract important variables;
detect outliers and anomalies;
test underlying assumptions;
develop parsimonious models; and
determine optimal factor settings.

Focus The EDA approach is precisely that--an approach--not a set of techniques, but an attitude/philosophy about how a data analysis should be carried out. Philosophy EDA is not identical to statistical graphics although the two terms are used almost interchangeably. Statistical graphics is a collection of techniques--all graphically based and all focusing on one data characterization aspect. EDA encompasses a larger venue; EDA is an approach to data analysis that postpones the usual assumptions about what kind of model the data follow with the more direct approach of allowing the data itself to reveal its underlying structure and model. EDA is not a mere collection of techniques; EDA is a philosophy as to how we dissect a data set; what we look for; how we look; and how we interpret. It is true that EDA heavily uses the collection of techniques that we call "statistical graphics", but it is not identical to statistical graphics per se. History The seminal work in EDA is Exploratory Data Analysis, Tukey, (1977). Over the years it has benefitted from other noteworthy publications such as Data Analysis and Regression, Mosteller and Tukey (1977), Interactive Data Analysis, Hoaglin (1977), The ABC's of EDA, Velleman and Hoaglin (1981) and has gained a large following as "the" way to analyze a data set. Techniques Most EDA techniques are graphical in nature with a few quantitative techniques. The reason for the heavy reliance on graphics is that by its very nature the main role of EDA is to open-mindedly explore, and graphics gives the analysts unparalleled power to do so, enticing the data to reveal its structural secrets, and being always ready to gain some new, often unsuspected, insight into the data. In combination with the natural pattern-recognition capabilities that we all possess, graphics provides, of course, unparalleled power to carry this out.

The particular graphical techniques employed in EDA are often quite simple, consisting of various techniques of:

The process of using numerical summaries and visualizations to explore your data and to identify potential relationships between variables is called exploratory data analysis, or EDA.

Exploratory data analysis is an investigative process in which you use summary statistics and graphical tools to get to know your data and understand what you can learn from them.

With EDA, you can find anomalies in your data, such as outliers or unusual observations, uncover patterns, understand potential relationships among variables, and generate interesting questions or hypotheses that you can test later using more formal statistical methods.

Exploratory data analysis is like detective work: you're searching for clues and insights that can lead to the identification of potential root causes of the problem you are trying to solve. You explore one variable at a time, then two variables at a time, and then many variables at a time.

Although EDA encompasses tables of summary statistics such as the mean and standard deviation, most people focus on graphs. You use a variety of graphs and exploratory tools, and you go where your data take you. If one graph or analysis isn't informative, you look at the data from another perspective.

Because EDA involves exploring, it is iterative. You are likely to learn different aspects about your data from different graphs. Typical goals are understanding:

Exploratory data analysis is the process of performing investigations on data to understand the data better.

In this, initial investigations are done to determine patterns, spot abnormalities, test hypotheses, and also to check if the assumptions are right.

In data mining, Exploratory Data Analysis (EDA) is an approach to analyzing datasets to summarize their main characteristics, often with visual methods. EDA is used for seeing what the data can tell us before the modeling task. It is not easy to look at a column of numbers or a whole spreadsheet and determine important characteristics of the data. It may be tedious, boring, and/or overwhelming to derive insights by looking at plain numbers. Exploratory data analysis techniques have been devised as an aid in this situation.

Exploratory data analysis is generally cross-classified in two ways. First, each method is either non-graphical or graphical. And second, each method is either univariate or multivariate (usually just bivariate).

Exploratory Data Analysis with Chartio

We will perform exploratory data analysis on the iris dataset to familiarize ourselves with the EDA process. Let’s look at a few sample data points:

The dataset contains four features – sepal length, sepal width, petal length, and petal width for each the different species (versicolor, virginica, setosa) of the iris flower. In the dataset, there are 50 instances (rows of data) of each species, a total of 150 data points.

Univariate Analysis

Univariate analysis is the simplest form of data analysis, where the data being analyzed consists of only one variable. Since it’s a single variable, it doesn’t deal with causes or relationships. The main purpose of univariate analysis is to describe the data and find patterns that exist within it. Let us look at a few visualizations used for performing univariate analysis.

Box Plots

A box and whisker plot – also called a box plot – displays the five-number summary of a set of data. The five-number summary is the minimum, first quartile, median, third quartile, and maximum.

The box plots created in Chartio provide us with the summary of the four numerical features in the dataset. We can observe that the distribution of petal length and width is more spread out, as exhibited by the bigger size of the boxes. Whereas, the sepal length and width is concentrated around it’s median. Moreover, in the sepal width box plot, we can observe a few outliers, as shown by the dots above and below the whisker.

Histogram

A histogram is a plot that lets you discover, and show, the underlying frequency distribution (shape) of a set of continuous data. This allows the inspection of the data for its underlying distribution (e.g. normal distribution), outliers, skewness, etc.

The above plots show the histogram of sepal and petal widths made in Chartio. From the charts it can be observed that the sepal width follows a Gaussian distribution. However, petal width is more skewed towards the right, and the majority of the flower samples have a petal width less than 0.4 cm.

Multivariate analysis

Multivariate data analysis refers to any statistical technique used to analyze data that arises from more than one variable. This models more realistic applications, where each situation, product, or decision involves more than a single variable. Let us look at a few visualizations used for performing multivariate analysis.

Scatter Plot

A scatter plot is a two-dimensional data visualization that uses dots to represent the values obtained for two different variables – one plotted along the x-axis and the other plotted along the y-axis.

Above are examples of two scatter plots made using Chartio. We can observe that there is a linear relationship between petal length and width. However, with increase in sepal length, the sepal width does not increase proportionally – hence they do not have a linear relationship.

In a scatter plot, if the points are color-coded, an additional variable can be displayed. For example, let us create the petal length vs width chart below by color coding each point based on the flower species.

We can observe that ‘setosa’ species has the lowest petal length and width, ‘virginica’ has the highest, and ‘versicolor’ lies between them. By plotting more dimensions, deeper insights can be drawn from the data.

Bar Chart

A bar chart represents categorical data, with rectangular bars having lengths proportional to the values that they represent. For example, we can use the iris dataset to observe the average petal and sepal lengths/widths of all the different species.

Observing the bar charts, we can conclude that ‘virginica’ has the highest petal length, petal width and sepal length, followed by ‘versicolor’ and ‘setosa’. However, sepal width deviates from this trend where ‘setosa’ is highest followed by ‘virginica’ and ‘versicolor’.

The exploratory data analysis we performed provides us with a good understanding of what the data contains. Once this stage is complete, we can perform more complex modeling tasks such as clustering and classification.

Apart from the charts shown in our EDA example, we can use various other charts depending on the characteristics of our data:

Line charts to show changes over time
Pie charts to show the relationship between a part to a whole
Map charts to visualize location data

Conclusion

EDA is a crucial step to take before diving into machine learning or statistical modeling because it provides the context needed to develop an appropriate model for the problem at hand and to correctly interpret its results. EDA is valuable to the data scientist to make certain that the results they produce are valid, correctly interpreted, and applicable to the desired business contexts.