Cleaning data in python datacamp answers
From DataCamp. Show
1. Common data ProblemsCommon data types
Data type constrainsManipulating and analyzing data with incorrect data types could lead to compromised analysis as you go along the data science workflow. When working with new data, we could use the To describe data and check data types:The bicycle ride sharing data in San Francisco, The excise will
Summing strings and concatenating numbersAnother common data type problem is importing what should be numerical values as strings, as mathematical operations such as summing and multiplication lead to string concatenation, not numerical outputs. This exercise will convert the string column
Data range constrainsSometimes there might show up values that is out of the data range. For example, a future time included in the time point; or six stars in a five-star-system. Ways to deal with it:
Tire size constraintsBicycle tire sizes could be either 26″, 27″ or 29″ and are here correctly stored as a categorical value. In an effort to cut maintenance costs, the ride sharing provider decided to set the maximum tire size to be 27″. In this exercise, the
Back to the futureA new update to the data pipeline feeding into the A bug was discovered which was relaying rides taken today as taken next year. To fix this, you will find all instances of the The
DuplicationsHow big is your subset? You have the following
Choose the correct usage of The update however coincided with radically shorter average ride duration times and irregular user birth dates set in the future. Most importantly, the number of rides taken has increased by 20% overnight, leading you to think there might be both complete and incomplete duplicates in the In this exercise, you will confirm this suspicion by finding those duplicates. A sample of
Treating duplicatesIn the last exercise, you were able to verify that
the new update feeding into In this exercise, you will be treating those duplicated rows by first dropping complete duplicates, and then merging the incomplete duplicate rows into one while keeping the average
2. Text and categorical data problemsDifferent types of constraints:
Membership constraints: when recording content that should not exist. F. eks. when recording blood type, misspell the type from A+ to Z+. Other examples:
Finding consistencyIn this exercise and throughout this chapter, we will be working with the The DataFrame contains flight metadata such as the airline, the destination, waiting times as well as answers to key questions
regarding cleanliness, safety, and satisfaction. Another DataFrame named In this exercise, we will use both of these DataFrames to find survey answers with inconsistent values, and drop them, effectively performing an outer and inner join on both these DataFrames as seen in the video exercise. The
The output looks like this:
Take a look at the output. Out of the cleanliness, safety and satisfaction columns, which one has an inconsistent category and what is it? Next, find the column with different values
using
And this gives the following output when exploring the data:
Categories of errorsTo address common problems affecting categorical variables in the data includes white spaces and inconsistencies in the categories, and the problem of creating new categories and mapping existing ones to new ones. First, we can take a look at the values for a column using:
This will give an overview of numbers of values/categories for the variable. Than we can address the problems by: White spaces and inconsistencies:
Collapsing all of the state Creating or remapping categories:
Collapsing data into
categories: Create categories out of data -
The
Map categories to fewer ones: reducing categories in categorical column. For example:
This
returns: Inconsistent categoriesThe DataFrame contains flight metadata such as the airline, the destination, waiting times as well as answers to key questions regarding cleanliness, safety, and satisfaction on the San Francisco Airport. We will examine two categorical columns from this DataFrame,
The problems with the columns:
Remapping categoriesTo better understand survey respondents from airlines, you want to find out if there is a relationship between certain responses and the day of the week and wait time at the gate. The
The Instructions:
Cleaning text data
Removing titles and taking namesWhile collecting survey respondent metadata in the Your ultimate objective is to create two new columns named The
Keeping it descriptiveTo further understand travelers’ experiences in the San Francisco Airport, the quality assurance department sent out a qualitative questionnaire to all travelers who gave the airport the worst score on all possible categories. The objective behind this questionnaire is to identify common patterns in what travelers are saying about the airport. Their response is stored in the The
3. Advanced data problemsUniformity
Creating temperature data From F to C: C=(F−32)×5/9
Datetime formatting
Treating date data
Ambiguous datesYou have a DataFrame containing a subscription_date column that was collected from various sources with different Date formats such as YYYY-mm-dd and YYYY-dd-mm. What is the best way to unify the formats for ambiguous values such as 2019-04-07?
Uniform currenciesIn this exercise and throughout this chapter, you
will be working with a retail You are tasked with understanding the average account size and how investments vary by the size of account, however in order to produce this analysis accurately, you first need to unify the currency amount
into dollars. The
Uniform datesAfter having unified the currencies of your different account amounts, you want to add a temporal dimension to your analysis and see how customers have been investing their money given the size of their account over each year. The However, since this data was consolidated from multiple sources, you need to
make sure that all dates are of the same format. You will do so by converting this column into a
Take a look at the output. You tried converting the values to datetime using the default to_datetime() function without changing any argument, however received the following error: Why do you think that is?
Cross field validationThe use of multiple fields in a dataset to sanity check data integrity. Here, we specify Test from iPad Here, we specify And here we check whether the |