In the end, here is what I did with help of friends
#Importing all the necessary libraries
from collections import Counter
import matplotlib.pyplot as plt
import numpy as np
import string
#Opening/reading/editing file
filename=raw_input['Filename [e.g. yourfile.txt]: ']
cond=raw_input['What do you want to count? \n A] Words.\n B] Characters and Punctuation. \n Choice: ']
file=open[filename,'r']
#'r' allows us to read the file
text=file.read[]
#This allows us to view the entire text and assign it as a gigantic string
text=text.lower[]
'''We make the entire case lowercase to account for any words that have a capital letter due to sentence structure'''
if cond in ['A','a','A]','a]']:
set=['!', '#', '"', '%', '$',"''" '&', ']', '[', '+', '*', '--', ',', '/', '.', ';', ':', '=', '', '@', '[', ']', '\\', '_', '^', '`', '{', '}', '|', '~']
text="".join[l for l in text if l not in set]
'''Hyphenated words are secure, since the text has set '--' as the dash.'''
#Splitting the text into sepereate words, thus creating a big string array.
text=text.split[]
#We then use the Counter function to calculate the frequency of each word appearing in the text.
count=Counter[text]
'''This is not enough, since count is now a function dependant from speicifc strings. We use the .most_common function to create an array which contains the word and it's frequency in each element.'''
count=count.most_common[]
#Creating empty arrays, replace the 0 with our frequency values and plot it. Along with the experimental data, we will take the averaged proportionality constant [K] and plot the curve y=K/x
y=np.arange[len[count]]
x=np.arange[1,len[count]+1]
yn=["" for m in range[len[count]]]
'''it is important to change the range from 1 to len[count], since the value 'Rank' always starts from 1.'''
for i in range[len[count]]:
y[i]=count[i][1]
yn[i]=count[i][0]
K,Ks=round[np.average[x*y],2],round[np.std[x*y],2]
plt.plot[x,y,color='red',linewidth=3]
plt.plot[x,K/x,color='green',linewidth=2]
plt.xlabel['Rank']
plt.ticklabel_format[style='sci', axis='x', scilimits=[0,0]]
plt.ticklabel_format[style='sci', axis='y', scilimits=[0,0]]
plt.plot[0,0,'o',alpha=0]
plt.ylabel['Frequency']
plt.grid[True]
plt.title["Testing Zipf's Law: the relationship between the frequency and rank of a word in a text"]
plt.legend[['Experimental data', 'y=K/x, K=%s, $\delta_{K}$ = %s'%[K,Ks], 'Most used word=%s, least used=%s'%[count[0],count[-1]]], loc='best',numpoints=1]
plt.show[]
elif cond in ['B','b','B]','b]']:
text=text.translate[ None, string.whitespace ]
count=Counter[text]
count=count.most_common[]
y=np.arange[len[count]]
x=np.arange[1,len[count]+1]
yn=["" for m in range[len[count]]]
for i in range[len[count]]:
y[i]=count[i][1]
yn[i]=count[i][0]
K,Ks=round[np.average[x*y],2],round[np.std[x*y],2]
plt.plot[x,y,color='red',linewidth=3]
plt.plot[x,K/x,color='green',linewidth=2]
plt.xlabel['Rank']
plt.ticklabel_format[style='sci', axis='x', scilimits=[0,0]]
plt.ticklabel_format[style='sci', axis='y', scilimits=[0,0]]
plt.plot[0,0,'o',alpha=0]
plt.ylabel['Frequency']
plt.grid[True]
plt.title["Testing Zipf's Law: the relationship between the frequency and rank of a character/punctuation, in a text"]
plt.legend[['Experimental data', 'y=K/x, K=%s, $\delta_{K}$ = %s'%[K,Ks], 'Most used character=%s, least used=%s'%[count[0],count[-1]]], loc='best',numpoints=1]
plt.show[]
Counting is an essential task required for most analysis projects. The ability to take counts and visualize them graphically using frequency plots [histograms] enables the analyst to easily recognize patterns and relationships within the data. Good news is this can be accomplished using python with just 1 line of code!
import pandas as pd %matplotlib inline df = pd.read_csv['iris-data.csv'] #toy dataset df.head[]
5.1 | 3.5 | 1.4 | 0.2 | Iris-setosa |
4.9 | 3.0 | 1.4 | 0.2 | Iris-setosa |
4.7 | 3.2 | 1.3 | 0.2 | Iris-setosa |
4.6 | 3.1 | 1.5 | 0.2 | Iris-setosa |
5.0 | 3.6 | 1.4 | 0.2 | Iris-setosa |
0 Iris-setosa 1 Iris-setosa 2 Iris-setosa 3 Iris-setosa 4 Iris-setosa Name: class, dtype: object
Frequency Plot for Categorical Data
df['class'].value_counts[] #generate counts
Iris-virginica 50 Iris-setosa 49 Iris-versicolor 45 versicolor 5 Iris-setossa 1 Name: class, dtype: int64
Notice that the value_counts[]
function automatically provides the classes in decending order. Let's bring it to life with a frequency plot.
df['class'].value_counts[].plot[]
I think a bar graph would be more useful, visually.
df['class'].value_counts[].plot['bar']
df['class'].value_counts[].plot['barh'] #horizontal bar plot
df['class'].value_counts[].plot['barh'].invert_yaxis[] #horizontal bar plot
There you have it, a ranked bar plot for categorical data in just 1 line of code using python!
Histograms for Numberical Data
You know how to graph categorical data, luckily graphing numerical data is even easier using the hist[]
function.
df['sepal_length_cm'].hist[] #horizontal bar plot
df['sepal_length_cm'].hist[bins = 30] #add granularity
df['sepal_length_cm'].hist[bins = 30, range=[4, 8]] #add granularity & range
df['sepal_length_cm'].hist[bins = 30, range=[4, 8], facecolor='gray'] #add granularity & range & color
There you have it, a stylized histogram for numerical data using python in 1 compact line of code.