Character Analysis
Introduction
I have separated my work into sections for ease of flow. All Python code is included in this article. Observations of data are shown in the histogram and heatmap below.
Plot of Histogram
Histogram Observations
Here are a few observations about the histogram above:
- $space\ character$: The space character is by far the most frequent. This makes sense since after each word, a space appears
- ${j, z, x, k}$: Characters such as $j$, $z$, $x$, and $k$ are low frequency — not often present in common words
- $vowels$: It makes sense for the frequency of the vowels to be higher than consonants given how English is structured
Heatmap of Character Transitions
The heat map below visually represents the frequencies of the transitions $c_i \rightarrow c_{i+1}$ where $c_i$ is the $i^{th}$ character in the supplied text file.
Heatmap Observations
Here are a few observations about the heatmap above:
- $Common\ Occurences$: Some common occurrences: $t \rightarrow h$, $i \rightarrow n$, $n \rightarrow t$, $r \rightarrow e$, $t \rightarrow i$
- $Spaces$: As expected the row and column of the $space$ is quite active — this makes sense since all words start and end with a $space$
- $Bare$: It’s interesting but not unexpected that the right bottom right is quite bare — very low frequencies later in the alphabet
Robustness Parameter
The heatmap above is actually using a robust=True
parameter that normalizes the frequencies into a small range in order to improve the visualization. This is an improvement over the heatmap with the original frequencies. See below for the difference between the $RAW$ heatmap and the $ROBUST$ heatmap. More visual information can be obtained by using the $robust$ parameter since the `interesting’ events are much more pronounced.
Appendix A — Code
Packages
Below are important packages that I am importing for the program to work properly.
import pandas as pd
import fileinput as fi
import matplotlib.pyplot as plt
import seaborn as sns
import string
User Defined Functions
I have defined several functions used by the $\verb|main()|$ function:
def read(file):
"""Reads given file and parses characters
Args:
file: the text file to be parsed
Returns:
charArr: parsed character array
"""
return [i for line in fi.input(file) for i in line]
def count(array):
"""Counts characters and creates freq table
Args:
array: character array of text
Returns:
freq: dictionary that represents freq table
"""
return {c: array.count(c) for c in array}
def partition2(array):
"""Works similar to Mathematica's partition function
but slightly differently. This function will create
a string that combines each pair of characters in
order to be hashed through by the count function.
Args:
array: this array
Returns:
"""
return [str(array[i]) + str(array[i + 1]) for i in range(len(array) - 1)]
def dict_print(d):
"""Print function specifically for dictionary
Args:
d: dictionary
Returns:
None: only prints out the contents of the dictionary
"""
[print(key[0] + ' --- ' + key[1] + ' :\t' + str(val)) for key, val in d.items()]
def to_dataframe(d):
"""converts the dictionary of transitions to a dataframe from which
can be turned into a heatmap
Args:
d: dictionary
Returns:
df: dataframe
"""
# :: Create dataframe
df = pd.DataFrame(columns=('First', 'Second', 'Frequency'))
# :: Initialize matrix
alpha = list(string.ascii_letters)[:26]
alpha.append(' ')
for i in alpha:
for j in alpha:
df = df.append(pd.Series([i, j, 0], index=df.columns), ignore_index=True)
# :: Pivot our dataframe to make a matrix for heatmap
df = df.pivot("First", "Second", "Frequency")
# :: Add relevant frequencies to the matrix
for k in d:
df[k[1]][k[0]] = d[k]
df = df[df.columns].astype(int)
return df
def show_heatmap(df, filename):
"""Create and plot heatmap of data
Args:
df: dataframe of frequencies
Returns:
None: Instead will plot a heatmap of the data
"""
# :: Create heatmap and customize
sns.set()
ax = sns.heatmap(df, cmap="binary", robust=True, xticklabels=True, yticklabels=True)
ax.xaxis.set_label_position('top')
ax.xaxis.set_ticks_position('top')
ax.spines['top'].set_visible(False)
ax.tick_params(top=False, left=False)
ax.xaxis.label.set_color('darkgray')
ax.yaxis.label.set_color('darkgray')
ax.tick_params(axis='x', colors='darkgray')
ax.tick_params(axis='y', colors='darkgray')
plt.xlabel('Second Letter', fontsize=18)
plt.ylabel('First Letter', fontsize=18)
plt.show()
figure = ax.get_figure()
figure.savefig(filename, dpi=400)
Main Program
This shows the code for the main program which utilizes the functions above.
def main():
# :: Reads in text file
# :: Counts the frequencies
# :: Data stored in dictionary
# :: Plots histogram of results
freq_dict = count(read('text.txt'))
plt.bar(freq_dict.keys(), freq_dict.values(), color='gray')
plt.title('Character Histogram')
plt.xlabel('Characters')
plt.ylabel('Frequency')
plt.show()
# :: Reads in text file
# :: Partitions in 2-tuples for transitions
# :: Data stored in dictionary
# :: Frequencies are printed to console/terminal
dict_print(count(partition2(read('text.txt'))))
df = to_dataframe(count(partition2(read('text.txt'))))
print(df)
filename = '/Users/ericpena/iCloud/Binghamton_Courses/500_Computational_Tools/HW2/heatmap.png'
show_heatmap(df, filename)
if __name__ == '__main__':
main()
Appendix B — Output Data
Histogram Frenquencies
{’d’: 234, ‘i’: 574, ‘f’: 233, ’e’: 958, ‘r’: 428, ’n’: 492, ‘c’: 255, ’ ‘: 1370, ‘w’: 111, ‘h’: 344, ’m’: 184, ’s’: 455, ’t’: 653, ‘o’: 475, ‘u’: 206, ‘a’: 561, ‘p’: 146, ’l’: 336, ‘y’: 77, ‘x’: 24, ‘b’: 111, ‘k’: 15, ‘g’: 103, ‘v’: 60, ‘q’: 20, ‘j’: 9, ‘z’: 11}
Heatmap Frenquencies
d --- i : 30
i --- f : 11
f --- f : 15
f --- e : 10
e --- r : 114
r --- e : 113
e --- n : 105
n --- c : 22
c --- e : 49
e --- : 339
--- w : 79
w --- h : 23
h --- e : 210
--- m : 53
m --- c : 1
c --- : 8
--- i : 72
i --- s : 61
s --- : 199
--- t : 252
t --- h : 212
m --- o : 13
o --- i : 5
s --- t : 41
t --- u : 11
u --- r : 46
--- c : 90
c --- o : 54
o --- n : 104
n --- t : 88
t --- e : 94
t --- : 107
m --- a : 18
a --- : 40
a --- s : 55
s --- s : 18
--- o : 93
o --- f : 70
f --- : 73
--- s : 69
s --- a : 14
a --- m : 23
m --- p : 25
p --- l : 14
l --- e : 47
--- a : 168
a --- f : 6
f --- t : 4
r --- : 64
--- h : 40
h --- u : 12
u --- m : 9
m --- i : 37
i --- d : 28
i --- t : 46
t --- y : 11
y --- : 60
--- e : 40
e --- x : 11
x --- p : 3
p --- o : 41
o --- s : 28
s --- u : 28
a --- n : 92
n --- d : 65
d --- : 140
m --- d : 1
--- d : 39
d --- r : 2
r --- y : 12
--- r : 37
e --- s : 71
u --- l : 21
l --- t : 4
t --- s : 17
s --- c : 3
c --- u : 3
u --- s : 31
s --- i : 53
i --- o : 60
n --- : 131
c --- h : 34
e --- m : 26
i --- c : 47
c --- a : 45
a --- l : 81
l --- : 50
o --- m : 24
t --- i : 92
--- f : 103
f --- i : 59
i --- b : 38
b --- e : 59
r --- s : 37
w --- e : 33
e --- l : 40
l --- l : 49
--- k : 1
k --- n : 1
n --- o : 21
o --- w : 12
w --- n : 3
h --- a : 34
a --- t : 55
--- l : 37
l --- i : 42
i --- g : 23
g --- n : 4
o --- c : 10
l --- u : 30
l --- o : 18
i --- n : 128
n --- v : 2
v --- e : 34
g --- a : 8
e --- d : 68
o --- u : 27
u --- n : 17
n --- e : 37
--- q : 2
q --- u : 20
u --- a : 8
i --- e : 24
d --- o : 5
o --- e : 2
--- n : 19
o --- t : 30
a --- d : 10
d --- d : 1
--- u : 12
u --- p : 9
p --- : 6
t --- o : 47
o --- : 48
i --- m : 21
l --- y : 27
--- b : 46
e --- c : 25
a --- u : 9
s --- e : 44
n --- l : 4
a --- j : 1
j --- o : 1
o --- r : 66
a --- r : 64
e --- p : 6
r --- t : 15
d --- e : 24
e --- t : 32
r --- m : 17
--- p : 66
p --- e : 20
c --- t : 35
p --- r : 27
r --- o : 42
e --- i : 11
n --- s : 24
x --- t : 4
t --- r : 17
r --- a : 32
a --- c : 43
t --- a : 24
a --- b : 18
b --- l : 12
r --- g : 6
n --- i : 28
t --- t : 15
u --- c : 7
h --- : 31
w --- a : 19
a --- x : 12
x --- e : 2
f --- a : 20
l --- c : 1
o --- h : 1
h --- o : 18
o --- l : 26
l --- s : 11
c --- i : 8
d --- s : 16
i --- l : 28
l --- a : 44
r --- l : 5
e --- q : 7
u --- e : 20
a --- p : 11
p --- p : 4
o --- x : 1
x --- : 9
w --- t : 3
h --- i : 26
--- g : 19
g --- o : 2
o --- o : 2
o --- d : 2
a --- g : 7
g --- r : 16
e --- e : 18
m --- e : 46
--- v : 22
v --- a : 20
b --- y : 10
e --- z : 2
z --- : 2
n --- z : 1
z --- a : 1
f --- l : 8
a --- v : 12
g --- h : 10
b --- a : 11
r --- n : 11
n --- h : 6
s --- k : 6
k --- : 7
e --- a : 39
r --- c : 2
g --- e : 13
u --- g : 5
g --- u : 10
u --- i : 13
e --- y : 5
n --- g : 33
g --- : 29
f --- r : 9
m --- : 19
r --- i : 36
e --- o : 6
o --- g : 3
p --- h : 4
e --- g : 6
g --- i : 7
o --- p : 10
r --- f : 13
s --- h : 13
w --- s : 3
h --- t : 7
a --- i : 13
w --- i : 15
s --- w : 3
x --- i : 4
m --- u : 7
d --- u : 7
i --- q : 9
p --- i : 7
i --- i : 1
i --- : 1
e --- f : 7
p --- a : 9
c --- k : 3
k --- e : 5
e --- v : 10
f --- u : 9
b --- s : 1
s --- o : 13
r --- p : 4
p --- t : 6
m --- n : 8
f --- o : 26
n --- f : 6
d --- a : 3
i --- a : 23
h --- l : 1
i --- k : 3
n --- y : 1
n --- a : 19
r --- v : 1
l --- w : 1
a --- y : 3
y --- s : 2
v --- i : 5
r --- r : 8
s --- p : 5
i --- z : 3
z --- e : 4
o --- b : 1
b --- t : 1
i --- p : 1
y --- i : 2
i --- v : 10
c --- r : 9
c --- c : 2
g --- y : 1
--- z : 3
z --- i : 4
s --- m : 7
c --- l : 4
p --- u : 4
t --- w : 6
m --- s : 5
b --- o : 5
l --- d : 11
b --- i : 4
p --- s : 4
b --- u : 7
u --- t : 12
h --- y : 2
y --- d : 1
i --- r : 8
c --- y : 1
g --- g : 1
a --- z : 1
n --- k : 1
y --- z : 1
l --- m : 1
--- y : 3
y --- p : 2
x --- c : 2
r --- u : 4
u --- f : 1
d --- l : 1
o --- a : 1
s --- y : 1
y --- m : 1
o --- v : 1
d --- v : 2
u --- : 1
--- j : 5
j --- u : 2
y --- t : 3
a --- q : 1
y --- r : 1
g --- l : 1
w --- o : 6
r --- d : 5
u --- d : 2
u --- b : 3
y --- e : 4
u --- o : 1
m --- m : 1
e --- w : 6
w --- : 5
s --- b : 1
g --- f : 1
m --- b : 3
a --- w : 1
a --- k : 1
b --- : 1
n --- u : 2
k --- s : 2
n --- j : 1
j --- a : 1
s --- r : 1
a --- e : 1
j --- e : 5
a --- h : 1
r --- b : 1
o --- j : 1
e --- u : 2
v --- o : 1
s --- l : 4
h --- m : 1
h --- r : 2
d --- w : 3
w --- r : 1
e --- j : 1
s --- q : 1