Hướng dẫn rank plot python

In the end, here is what I did with help of friends

#Importing all the necessary libraries
from collections import Counter
import matplotlib.pyplot as plt
import numpy as np
import string

#Opening/reading/editing file

filename=raw_input('Filename (e.g. yourfile.txt): ')
cond=raw_input('What do you want to count? \n A) Words.\n B) Characters and     Punctuation. \n Choice: ')
file=open(filename,'r')
#'r' allows us to read the file
text=file.read()
#This allows us to view the entire text and assign it as a gigantic string
text=text.lower()
'''We make the entire case lowercase to account for any words that have a capital    letter due to sentence structure'''
if cond in ['A','a','A)','a)']:
    set=['!', '#', '"', '%', '$',"''" '&', ')', '(', '+', '*', '--', ',', '/', '.', ';', ':', '=', '<', '?', '>', '@', '[', ']', '\\', '_', '^', '`', '{', '}', '|', '~']
    text="".join(l for l in text if l not in set)
    '''Hyphenated words are secure, since the text has set '--' as the dash.'''
    #Splitting the text into sepereate words, thus creating a big string array.
    text=text.split()
    #We then use the Counter function to calculate the frequency of each word appearing in the text.
    count=Counter(text)
    '''This is not enough, since count is now a function dependant from speicifc strings. We use the .most_common function to create an array which contains the word and it's frequency in each element.'''
    count=count.most_common()
    #Creating empty arrays, replace the 0 with our frequency values and plot it.    Along with the experimental data, we will take the averaged proportionality constant (K) and plot the curve y=K/x
    y=np.arange(len(count))
    x=np.arange(1,len(count)+1)
    yn=["" for m in range(len(count))]
    '''it is important to change the range from 1 to len(count), since the value  'Rank' always starts from 1.'''
    for i in range(len(count)):
        y[i]=count[i][1]
        yn[i]=count[i][0]
    K,Ks=round(np.average(x*y),2),round(np.std(x*y),2)
    plt.plot(x,y,color='red',linewidth=3)
    plt.plot(x,K/x,color='green',linewidth=2)
    plt.xlabel('Rank')
    plt.ticklabel_format(style='sci', axis='x', scilimits=(0,0))
    plt.ticklabel_format(style='sci', axis='y', scilimits=(0,0))
    plt.plot(0,0,'o',alpha=0)
    plt.ylabel('Frequency')
    plt.grid(True)
    plt.title("Testing Zipf's Law: the relationship between the frequency and rank of a word in a text")
    plt.legend(['Experimental data', 'y=K/x, K=%s, $\delta_{K}$ = %s'%(K,Ks),     'Most used word=%s, least used=%s'%(count[0],count[-1])], loc='best',numpoints=1)
    plt.show()
elif cond in ['B','b','B)','b)']:
    text=text.translate( None, string.whitespace )
    count=Counter(text)
    count=count.most_common()
    y=np.arange(len(count))
    x=np.arange(1,len(count)+1)
    yn=["" for m in range(len(count))]
    for i in range(len(count)):
        y[i]=count[i][1]
        yn[i]=count[i][0]
    K,Ks=round(np.average(x*y),2),round(np.std(x*y),2)
    plt.plot(x,y,color='red',linewidth=3)
    plt.plot(x,K/x,color='green',linewidth=2)
    plt.xlabel('Rank')
    plt.ticklabel_format(style='sci', axis='x', scilimits=(0,0))
    plt.ticklabel_format(style='sci', axis='y', scilimits=(0,0))
    plt.plot(0,0,'o',alpha=0)
    plt.ylabel('Frequency')
    plt.grid(True)
    plt.title("Testing Zipf's Law: the relationship between the frequency and rank of a character/punctuation,  in a text")
    plt.legend(['Experimental data', 'y=K/x, K=%s, $\delta_{K}$ = %s'%(K,Ks), 'Most used character=%s, least used=%s'%(count[0],count[-1])],       loc='best',numpoints=1)
    plt.show()

Counting is an essential task required for most analysis projects. The ability to take counts and visualize them graphically using frequency plots (histograms) enables the analyst to easily recognize patterns and relationships within the data. Good news is this can be accomplished using python with just 1 line of code!

import pandas as pd
%matplotlib inline

df = pd.read_csv('iris-data.csv') #toy dataset
df.head()
sepal_length_cmsepal_width_cmpetal_length_cmpetal_width_cmclass
05.1 3.5 1.4 0.2 Iris-setosa
14.9 3.0 1.4 0.2 Iris-setosa
24.7 3.2 1.3 0.2 Iris-setosa
34.6 3.1 1.5 0.2 Iris-setosa
45.0 3.6 1.4 0.2 Iris-setosa
0    Iris-setosa
1    Iris-setosa
2    Iris-setosa
3    Iris-setosa
4    Iris-setosa
Name: class, dtype: object

Frequency Plot for Categorical Data

df['class'].value_counts() #generate counts
Iris-virginica     50
Iris-setosa        49
Iris-versicolor    45
versicolor          5
Iris-setossa        1
Name: class, dtype: int64

Notice that the value_counts() function automatically provides the classes in decending order. Let's bring it to life with a frequency plot.

df['class'].value_counts().plot()

Hướng dẫn rank plot python

I think a bar graph would be more useful, visually.

df['class'].value_counts().plot('bar')

df['class'].value_counts().plot('barh') #horizontal bar plot

df['class'].value_counts().plot('barh').invert_yaxis() #horizontal bar plot

There you have it, a ranked bar plot for categorical data in just 1 line of code using python!

Histograms for Numberical Data

You know how to graph categorical data, luckily graphing numerical data is even easier using the hist() function.

df['sepal_length_cm'].hist() #horizontal bar plot

df['sepal_length_cm'].hist(bins = 30) #add granularity

df['sepal_length_cm'].hist(bins = 30, range=[4, 8]) #add granularity & range

df['sepal_length_cm'].hist(bins = 30, range=[4, 8], facecolor='gray') #add granularity & range & color

There you have it, a stylized histogram for numerical data using python in 1 compact line of code.