Hướng dẫn rank plot python

In the end, here is what I did with help of friends

#Importing all the necessary libraries
from collections import Counter
import matplotlib.pyplot as plt
import numpy as np
import string

#Opening/reading/editing file

filename=raw_input['Filename [e.g. yourfile.txt]: ']
cond=raw_input['What do you want to count? \n A] Words.\n B] Characters and     Punctuation. \n Choice: ']
file=open[filename,'r']
#'r' allows us to read the file
text=file.read[]
#This allows us to view the entire text and assign it as a gigantic string
text=text.lower[]
'''We make the entire case lowercase to account for any words that have a capital    letter due to sentence structure'''
if cond in ['A','a','A]','a]']:
    set=['!', '#', '"', '%', '$',"''" '&', ']', '[', '+', '*', '--', ',', '/', '.', ';', ':', '=', '', '@', '[', ']', '\\', '_', '^', '`', '{', '}', '|', '~']
    text="".join[l for l in text if l not in set]
    '''Hyphenated words are secure, since the text has set '--' as the dash.'''
    #Splitting the text into sepereate words, thus creating a big string array.
    text=text.split[]
    #We then use the Counter function to calculate the frequency of each word appearing in the text.
    count=Counter[text]
    '''This is not enough, since count is now a function dependant from speicifc strings. We use the .most_common function to create an array which contains the word and it's frequency in each element.'''
    count=count.most_common[]
    #Creating empty arrays, replace the 0 with our frequency values and plot it.    Along with the experimental data, we will take the averaged proportionality constant [K] and plot the curve y=K/x
    y=np.arange[len[count]]
    x=np.arange[1,len[count]+1]
    yn=["" for m in range[len[count]]]
    '''it is important to change the range from 1 to len[count], since the value  'Rank' always starts from 1.'''
    for i in range[len[count]]:
        y[i]=count[i][1]
        yn[i]=count[i][0]
    K,Ks=round[np.average[x*y],2],round[np.std[x*y],2]
    plt.plot[x,y,color='red',linewidth=3]
    plt.plot[x,K/x,color='green',linewidth=2]
    plt.xlabel['Rank']
    plt.ticklabel_format[style='sci', axis='x', scilimits=[0,0]]
    plt.ticklabel_format[style='sci', axis='y', scilimits=[0,0]]
    plt.plot[0,0,'o',alpha=0]
    plt.ylabel['Frequency']
    plt.grid[True]
    plt.title["Testing Zipf's Law: the relationship between the frequency and rank of a word in a text"]
    plt.legend[['Experimental data', 'y=K/x, K=%s, $\delta_{K}$ = %s'%[K,Ks],     'Most used word=%s, least used=%s'%[count[0],count[-1]]], loc='best',numpoints=1]
    plt.show[]
elif cond in ['B','b','B]','b]']:
    text=text.translate[ None, string.whitespace ]
    count=Counter[text]
    count=count.most_common[]
    y=np.arange[len[count]]
    x=np.arange[1,len[count]+1]
    yn=["" for m in range[len[count]]]
    for i in range[len[count]]:
        y[i]=count[i][1]
        yn[i]=count[i][0]
    K,Ks=round[np.average[x*y],2],round[np.std[x*y],2]
    plt.plot[x,y,color='red',linewidth=3]
    plt.plot[x,K/x,color='green',linewidth=2]
    plt.xlabel['Rank']
    plt.ticklabel_format[style='sci', axis='x', scilimits=[0,0]]
    plt.ticklabel_format[style='sci', axis='y', scilimits=[0,0]]
    plt.plot[0,0,'o',alpha=0]
    plt.ylabel['Frequency']
    plt.grid[True]
    plt.title["Testing Zipf's Law: the relationship between the frequency and rank of a character/punctuation,  in a text"]
    plt.legend[['Experimental data', 'y=K/x, K=%s, $\delta_{K}$ = %s'%[K,Ks], 'Most used character=%s, least used=%s'%[count[0],count[-1]]],       loc='best',numpoints=1]
    plt.show[]

Counting is an essential task required for most analysis projects. The ability to take counts and visualize them graphically using frequency plots [histograms] enables the analyst to easily recognize patterns and relationships within the data. Good news is this can be accomplished using python with just 1 line of code!

import pandas as pd
%matplotlib inline

df = pd.read_csv['iris-data.csv'] #toy dataset
df.head[]
sepal_length_cmsepal_width_cmpetal_length_cmpetal_width_cmclass01234
5.1 3.5 1.4 0.2 Iris-setosa
4.9 3.0 1.4 0.2 Iris-setosa
4.7 3.2 1.3 0.2 Iris-setosa
4.6 3.1 1.5 0.2 Iris-setosa
5.0 3.6 1.4 0.2 Iris-setosa
0    Iris-setosa
1    Iris-setosa
2    Iris-setosa
3    Iris-setosa
4    Iris-setosa
Name: class, dtype: object

Frequency Plot for Categorical Data

df['class'].value_counts[] #generate counts
Iris-virginica     50
Iris-setosa        49
Iris-versicolor    45
versicolor          5
Iris-setossa        1
Name: class, dtype: int64

Notice that the value_counts[] function automatically provides the classes in decending order. Let's bring it to life with a frequency plot.

df['class'].value_counts[].plot[]

I think a bar graph would be more useful, visually.

df['class'].value_counts[].plot['bar']

df['class'].value_counts[].plot['barh'] #horizontal bar plot

df['class'].value_counts[].plot['barh'].invert_yaxis[] #horizontal bar plot

There you have it, a ranked bar plot for categorical data in just 1 line of code using python!

Histograms for Numberical Data

You know how to graph categorical data, luckily graphing numerical data is even easier using the hist[] function.

df['sepal_length_cm'].hist[] #horizontal bar plot

df['sepal_length_cm'].hist[bins = 30] #add granularity

df['sepal_length_cm'].hist[bins = 30, range=[4, 8]] #add granularity & range

df['sepal_length_cm'].hist[bins = 30, range=[4, 8], facecolor='gray'] #add granularity & range & color

There you have it, a stylized histogram for numerical data using python in 1 compact line of code.

Chủ Đề