From DataCamp.
1. Common data Problems
Common data types
- Numeric data types
- Text
- Dates
Data type constrains
Manipulating and analyzing data with incorrect data types could lead to compromised analysis as you go along the data science workflow.
When working with new data, we could use the .dtypes
attribute or the .info[]
method. Often times, you’ll run into columns that should be converted to different data types before starting any analysis.
To describe data and check data types:
The bicycle ride sharing data in San Francisco, ride_sharing
, contains information on the start and end stations, the trip duration, and some user information for a bike sharing service.
The excise will
- Print the information of
ride_sharing
. Use.describe[]
to print the summary statistics of theuser_type
column fromride_sharing
.
# Print the information of ride_sharing
print[ride_sharing.info[]]
# Print summary statistics of user_type column
print[ride_sharing['user_type'].describe[]]
Summing strings and concatenating numbers
Another common data type problem is importing what should be numerical values as strings, as mathematical operations such as summing and multiplication lead to string concatenation, not numerical outputs.
This exercise will convert the string column duration
to the type int
. First, strip "minutes"
from
the column in order to make sure pandas
reads it as numerical. The pandas
package has been imported as pd
.
- Use the
.strip[]
method to stripduration
of"minutes"
and store it in theduration_trim
column. - Convert
duration_trim
toint
and store it in theduration_time column
. - Write an
assert
statement that checks ifduration_time
’s data type is now anint
. - Print the average ride duration.
# Strip duration of minutes
ride_sharing['duration_trim'] = ride_sharing['duration'].str.strip['minutes']
# Convert duration to integer
ride_sharing['duration_time'] = ride_sharing['duration_trim'].astype['int']
# Write an assert statement making sure of conversion
assert ride_sharing['duration_time'].dtype == 'int'
# Print formed columns and calculate average ride duration
print[ride_sharing[['duration','duration_trim','duration_time']]]
print[ride_sharing['duration_time'].mean[]]
Data range constrains
Sometimes there might show up values that is out of the data range. For example, a future time included in the time point; or six stars in a five-star-system.
Ways to deal with it:
- Drop values using filtering:
movies = movies[movies['avg_rating'] 5].index, inplace = True]
- Assert results:
assert movies['avg_rating'].max[]