Hướng dẫn dùng heirarchical python
Introduction
After reading the guide, you will understand:
Note: You can download the notebook containing all of the code in this guide here. MotivationImagine a scenario in which you are part of a data science team that interfaces with the marketing department. Marketing has been gathering customer shopping data for a while, and they want to understand, based on the collected data, if there are similarities between customers. Those similarities divide customers into groups and having customer groups helps in the targeting of campaigns, promotions, conversions, and building better customer relationships.
One way of answering those questions is by using a clustering algorithm, such as K-Means, DBSCAN, Hierarchical Clustering, etc. In general terms, clustering algorithms find similarities between data points and group them. In this case, our marketing data is fairly small. We have information on only 200 customers. Considering the marketing team, it is important that we can clearly explain to them how the decisions were made based on the number of clusters, therefore explaining to them how the algorithm actually works. Since our data is small and explicability is a major factor, we can leverage Hierarchical Clustering to solve this problem. This process is also known as Hierarchical Clustering Analysis (HCA).
Another thing to take into consideration in this scenario is that HCA is an unsupervised algorithm. When grouping data, we won't have a way to verify that we are correctly identifying that a user belongs to a specific group (we don't know the groups). There are no labels for us to compare our results to. If we identified the groups correctly, it will be later confirmed by the marketing department on a day-to-day basis (as measured by metrics such as ROI, conversion rates, etc.). Now that we have understood the problem we are trying to solve and how to solve it, we can start to take a look at our data! Brief Exploratory Data AnalysisNote: You can download the dataset used in this guide here. After downloading the dataset, notice that it is a CSV (comma-separated values) file called
Marketing said it had collected 200
customer records. We can check if the downloaded data is complete with 200 rows using the
This results in:
Great! Our data is complete with 200 rows (client records) and we have also 5 columns (features). To see what characteristics the marketing department has collected from customers, we can see column names with the
The script above returns:
Here, we see that marketing has generated a Let's take a quick look at the distribution of this score to inspect the spending habits of users in our dataset. That's where the Pandas
By looking at the histogram we see that more than 35 customers have scores between To verify if that is true, we can look at the minimum and maximum values of the distribution. Those values can be easily found as part of the descriptive statistics, so we can use the
This will give us a table from where we can read distributions of other values of our dataset:
Our hypothesis is confirmed. The To understand better how our data varies, let's plot the
Which will give us: Notice in the
histogram that most of our data, more than 35 customers, is concentrated near the number
So far, we have seen the shape of our data, some of its distributions, and descriptive statistics. With Pandas, we can also list our data types and see if all of our 200 rows are filled or have some
This results in:
Here, we can see that there are no Let's see how
This results in:
It seems that it has only
This confirms our assumption:
So far, we know that we have only two genres, if we plan to use this feature on our model,
This outputs:
We have 56% of women in the dataset and 44% of men. The difference between them is only 16%, and our data is not 50/50 but is balanced enough not to cause any trouble. If the results were 70/30, 60/40, then it might have been needed either to collect more data or to employ some kind of data augmentation technique to make that ratio more balanced. Until now, all features but Advice: In this guide, we present only brief exploratory data analysis. But you can go further and you should go further. You can see if there are income differences and scoring differences based on genre and age. This not only enriches the analysis but leads to better model results. To go deeper into Exploratory Data Analysis, check out the EDA chapter in the "Hands-On House Price Prediction - Machine Learning in Python" Guided Project. After conjecturing on what could be done with both categorical - or categorical to be - Encoding Variables and Feature EngineeringLet's start by dividing the To group or bin
This results in:
Notice that when looking at the column values, there is also a line that specifies we have 6 categories and displays all the binned data intervals. This way, we have categorized our previously numerical data and created a new And how many customers do we have in each category? We can quickly know that by grouping the column
and counting the values with
This results in:
It is easy to spot that most customers are between 30 and 40 years of age, followed by customers between 20 and 30 and then customers between 40 and 50. This is also good information for the Marketing department. At the moment, we have two categorical variables,
This will give us a prewiew of the resulting table: With the output, it is easy to see that the column Also, the The advantage of one-hot encoding is the simplicity in representing the column values, it is straightforward to understand what is happening - while the disadvantage is that we have now created 8 additional columns, to sum up with the columns we already had. Warning: If you have a dataset in which the number of one-hot encoded columns exceeds the number of rows, it is best to employ another encoding method to avoid data dimensionality issues. One-hot encoding also adds 0s to our data, making it more sparse, which can be a problem for some algorithms that are sensitive to data sparsity. For our clustering needs, one-hot encoding seems to work. But we can plot the data to see if there really are distinct groups for us to cluster. Basic Plotting and Dimensionality ReductionOur dataset has 11 columns, and there are some ways in which we can visualize that data. The first one is by plotting it in 10-dimensions (good luck with that). Ten because the Plotting Each Pair of DataSince plotting 10 dimensions is a bit impossible, we'll opt to go with the second approach - we'll plot our initial features. We can choose two of them for our clustering analysis. One way we can see all of our data pairs combined is with a Seaborn
Which displays: At a glance, we can spot the scatterplots that seem to have groups of data. One that seems
interesting is the scatterplot that combines Both scatterplots consisting of
By looking closer, we can definitely distinguish 5 different groups of data. It seems our customers can be clustered based on how much they make in a year and how much they spend. This is another relevant point in our analysis. It is important that we are only taking two features into consideration to group our clients. Any other information we have about them is not entering the equation. This gives the analysis meaning - if we know how much a client earns and spends, we can easily find the similarities we need. That's great! So far, we already have two variables to build our model. Besides what this represents, it also makes the model simpler, parsimonious, and more explainable. Check out our hands-on, practical guide to learning Git, with best-practices, industry-accepted standards, and included cheat sheet. Stop Googling Git commands and actually learn it! Note: Data Science usually favors as simple approaches as possible. Not only because it is easier to explain for the business, but also because it is more direct - with 2 features and an explainable model, it is clear what the model is doing and how it is working. Plotting Data After Using PCAIt seems our second approach is probably the best, but let's also take a look at our third approach. It can be useful when we can't plot the data because it has too many dimensions, or when there are no data concentrations or clear separation in groups. When those situations occur, it's recommended to try reducing data dimensions with a method called Principal Component Analysis (PCA). Note: Most people use PCA for dimensionality reduction before visualization. There are other methods that help in data visualization prior to clustering, such as Density-Based Spatial Clustering of Applications with Noise (DBSCAN) and Self-Organizing Maps (SOM) clustering. Both are clustering algorithms, but can also be used for data visualization. Since clustering analysis has no golden standard, it is important to compare different visualizations and different algorithms. PCA will reduce the dimensions of our data while trying to preserve as much of its information as possible. Let's first get an idea about how PCA works, and then we can choose how many data dimensions we will reduce our data to. For each pair of features, PCA sees if the greater values of one variable correspond with the greater values of the other variable, and it does the same for the lesser values. So, it essentially computes how much the feature values vary towards one another - we call that their covariance. Those results are then organized into a matrix, obtaining a covariance matrix. After getting the covariance matrix, PCA tries to find a linear combination of features that best explains it - it fits linear models until it identifies the one that explains the maximum amount of variance. Note: PCA is a linear transformation, and linearity is sensitive to the scale of data. Therefore, PCA works best when all data values are on the same scale. This can be done by subtracting the column mean from its values and dividing the result by its standard deviation. That is called data standardization. Prior to using PCA, make sure the data is scaled! If you're not sure how, read our "Feature Scaling Data with Scikit-Learn for Machine Learning in Python"! With the best line (linear combination) found, PCA gets the directions of its axes, called eigenvectors, and its linear coefficients, the eigenvalues. The combination of the eigenvectors and eigenvalues - or axes directions and coefficients - are the Principal Components of PCA. And that is when we can choose our number of dimensions based on the explained variance of each feature, by understanding which principal components we want to keep or discard based on how much variance they explain. After obtaining the principal components, PCA uses the eigenvectors to form a vector of features that reorient the data from the original axes to the ones represented by the principal components - that's how the data dimensions are reduced. Note: One important detail to take into consideration here is that, due to its linear nature, PCA will concentrate most of the explained variance in the first principal components. So, when looking at the explained variance, usually our first two components will suffice. But that might be misleading in some cases - so try to keep comparing different plots and algorithms when clustering to see if they hold similar results. Before applying PCA, we need to choose between the
Now our data has 10 columns, which means we can obtain one principal component by column and choose how many of them we will use by measuring how much introducing one new dimension explains more of our data variance. Let's do that with Scikit-Learn
Our cumulative explained variances are:
We can see that the first dimension explains 50% of the data, and when combined to the second dimension, they explain 99% percent. This means that the first 2 dimensions already explain 99% of our data. So we can apply a PCA with 2 components, obtain our principal components and plot them:
The data plot after PCA is very similar to the plot that is using only two columns of the data without PCA. Notice that the points that are forming groups are closer, and a little more concentrated after the PCA than before. Visualizing Hierarchical Structure with DendrogramsSo far, we have explored the data, one-hot encoded categorical columns, decided which columns were fit for clustering, and reduced data dimensionality. The plots indicate we have 5 clusters in our data, but there's also another way to visualize the relationships between our points and help determine the number of clusters - by creating a dendrogram (commonly misspelled as dendogram). Dendro means tree in Latin. The dendrogram is a result of the linking of points in a dataset. It is a visual representation of the hierarchical clustering process. And how does the hierarchical clustering process work? Well... it depends - probably an answer you've already heard a lot in Data Science. Understanding Hierarchical ClusteringWhen the Hierarchical Clustering Algorithm (HCA) starts to link the points and find clusters, it can first split points into 2 large groups, and then split each of those two groups into smaller 2 groups, having 4 groups in total, which is the divisive and top-down approach. Alternatively, it can do the opposite - it can look at all the data points, find 2 points that are closer to each other, link them, and then find other points that are the closest ones to those linked points and keep building the 2 groups from the bottom-up. Which is the agglomerative approach we will develop. Steps to Perform Agglomerative Hierarchical ClusteringTo make the agglomerative approach even clear, there are steps of the Agglomerative Hierarchical Clustering (AHC) algorithm:
Note: For simplification, we are saying "two closest" data points in steps 2 and 3. But there are more ways of linking points as we will see in a bit.
Notice that HCAs can be either divisive and top-down, or agglomerative and bottom-up. The top-down DHC approach works best when you have fewer, but larger clusters, hence it's more computationally expensive. On the other hand, the bottom-up AHC approach is fitted for when you have many smaller clusters. It is computationally simpler, more used, and more available. Note: Either top-down or bottom-up, the dendrogram representation of the clustering process will always start with a division in two and end up with each individual point discriminated, once its underlying structure is of a binary tree. Let's plot our customer data dendrogram to visualize the hierarchical relationships of the data. This time, we will use the
The output of the script looks like this: In the script above, we've generated the clusters and subclusters with our points, defined how our points would link (by applying the With the plot of the dendrogram, the described processes of DHC and AHC can be visualized. To visualize the top-down approach start from the top of the dendrogram and go down, and do the opposite, starting down and moving upwards to visualize the bottom-up approach. Linkage MethodsThere are many other linkage methods, by understanding more about how they work, you will be able to choose the appropriate one for your needs. Besides that, each of them will yield different results when applied. There is not a fixed rule in clustering analysis, if possible, study the nature of the problem to see which fits its best, test different methods, and inspect the results. Some of the linkage methods are:
Distance MetricsBesides the linkage, we can also specify some of the most used distance metrics:
$$
$$
We have chosen Ward and Euclidean for the dendrogram because they are the most commonly used method and metric. They usually give good results since Ward links points based on minimizing the errors, and Euclidean works well in lower dimensions. In this example, we are working with two features (columns) of the marketing data, and 200 observations or rows. Since the number of observations is larger than the number of features (200 > 2), we are working in a low-dimensional space.
If we were to include more attributes, so we have more than 200 features, the Euclidean distance might not work very well, since it would have difficulty in measuring all the small distances in a very large space that only gets larger. In other words, the Euclidean distance approach has difficulties working with the data sparsity. This is an issue that is called the curse of dimensionality. The distance values would get so small, as if they became "diluted" in the larger space, distorted until they became 0. Note: If you ever encounter a dataset with f >> p, you will probably use other distance metrics, such as the Mahalanobis distance. Alternatively, you can also reduce the dataset dimensions, by using Principal Component Analysis (PCA). This problem is frequent especially when clustering biological sequencing data. We've already discussed metrics, linkages, and how each one of them can impact our results. Let's now continue the dendrogram analysis and see how it can give us an indication of the number of clusters in our dataset. Finding an interesting number of clusters in a dendrogram is the same as finding the largest horizontal space that doesn't have any vertical lines (the space with the longest vertical lines). This means that there's more separation between the clusters. We can draw a horizontal line that passes through that longest distance:
After locating the horizontal line, we count how many times our vertical lines were crossed by it - in this example, 5 times. So 5 seems a good indication of the number of clusters that have the most distance between them. Note: The dendrogram should be considered only as a reference when used to choose the number of clusters. It can easily get that number way off and is completely influenced by the type of linkage and distance metrics. When conducting an in-depth cluster analysis, it is advised to look at dendrograms with different linkages and metrics and to look at the results generated with the first three lines in which the clusters have the most distance between them. Implementing an Agglomerative Hierarchical ClusteringUsing Original DataSo far we've calculated the suggested number of clusters for our dataset that corroborate with our initial analysis and our PCA analysis. Now we can create our agglomerative hierarchical clustering model using Scikit-Learn
This results in:
We have investigated a lot to get to this point. And what does these labels mean? Here, we have each point of our data labeled as a group from 0 to 4:
This is our final clusterized data. You can see the color-coded data points in the form of five clusters. The data points in the bottom right (label: Similarly, the customers at the top right (label: The customers in the middle (label: The customers in the bottom left (label: And finally, the customers in the upper left (label: Using the Result from PCAIf we were in a different scenario, in which we had to reduce the dimensionality of data. We could also easily plot the clusterized PCA results. That can be done by creating another agglomerative clustering model and obtaining a data label for each principal component:
Observe that both results are very similar. The main difference is that the first result with the original data is much easier to explain. It is clear to see that customers can be divided into five groups by their annual income and spending score. While, in the PCA approach, we are taking all of our features into consideration, as much as we can look at the variance explained by each of them, this is a harder concept to grasp, especially when reporting to a Marketing department.
If you have a very large and complex dataset in which you must perform a dimensionality reduction prior to clustering - try to analyze the linear relationships between each of the features and their residuals to back up the use of PCA and enhance the explicability of the process. By making a linear model per pair of features, you will be able to understand how the features interact. If the data volume is so large, it becomes impossible to plot the pairs of features, select a sample of your data, as balanced and close to the normal distribution as possible and perform the analysis on the sample first, understand it, fine-tune it - and apply it later to the whole dataset. You can always choose different clustering visualization techniques according to the nature of your data (linear, non-linear) and combine or test all of them if necessary. Going Further - Hand-Held End-to-End ProjectYour inquisitive nature makes you want to go further? We recommend checking out our Guided Project: "Hands-On House Price Prediction - Machine Learning in Python".
Using Keras, the deep learning API built on top of Tensorflow, we'll experiment with architectures, build an ensemble of stacked models and train a meta-learner neural network (level-1 model) to figure out the pricing of a house. Deep learning is amazing - but before resorting to it, it's advised to also attempt solving the problem with simpler techniques, such as with shallow learning algorithms. Our baseline performance will be based on a Random Forest Regression algorithm. Additionally - we'll explore creating ensembles of models through Scikit-Learn via techniques such as bagging and voting. This is an end-to-end project, and like all Machine Learning projects, we'll start out with - with Exploratory Data Analysis, followed by Data Preprocessing and finally Building Shallow and Deep Learning Models to fit the data we've explored and cleaned previously. ConclusionThe clustering technique can be very handy when it comes to unlabeled data. Since most of the data in the real world are unlabeled and annotating the data has higher costs, clustering techniques can be used to label unlabeled data. In this guide, we have brought a real data science problem, since clustering techniques are largely used in marketing analysis (and also in biological analysis). We have also explained many of the investigation steps to get to a good hierarchical clustering model and how to read dendrograms and questioned if PCA is a necessary step. Our main objective is that some of the pitfalls and different scenarios in which we can find hierarchical clustering are covered. Happy clustering! |