Đầu tiên xây dựng một từ điển [đây là thuật ngữ kỹ thuật cho một danh sách tất cả các từ riêng biệt trong một tập hợp hoặc kho văn bản].
vocab = {}
i = 0
# loop through each list, find distinct words and map them to a
# unique number starting at zero
for word in A:
if word not in vocab:
vocab[word] = i
i += 1
for word in B:
if word not in vocab:
vocab[word] = i
i += 1
Từ điển vocab
hiện đang ánh xạ từng từ thành một số duy nhất bắt đầu từ 0. Chúng tôi sẽ sử dụng các số này làm chỉ số vào một mảng [hoặc vectơ].
Trong bước tiếp theo, chúng tôi sẽ tạo một thứ gọi là vectơ tần số thuật ngữ cho mỗi danh sách đầu vào. Chúng tôi sẽ sử dụng một thư viện có tên numpy
ở đây. Đó là một cách rất phổ biến để thực hiện loại tính toán khoa học này. Nếu bạn quan tâm đến sự tương tự cosine [hoặc các kỹ thuật học máy khác], thì đó là thời gian của bạn.
import numpy as np
# create a numpy array [vector] for each input, filled with zeros
a = np.zeros[len[vocab]]
b = np.zeros[len[vocab]]
# loop through each input and create a corresponding vector for it
# this vector counts occurrences of each word in the dictionary
for word in A:
index = vocab[word] # get index from dictionary
a[index] += 1 # increment count for that index
for word in B:
index = vocab[word]
b[index] += 1
Bước cuối cùng là tính toán thực tế về độ tương tự cosin.
# use numpy's dot product to calculate the cosine similarity
sim = np.dot[a, b] / np.sqrt[np.dot[a, a] * np.dot[b, b]]
Biến sim
hiện chứa câu trả lời của bạn. Bạn có thể rút từng biểu hiện phụ này ra và xác minh rằng chúng phù hợp với công thức ban đầu của bạn.
Với một chút tái cấu trúc kỹ thuật này có khả năng mở rộng khá nhiều [số lượng lớn các danh sách đầu vào, với số lượng từ tương đối lớn]. Đối với Corpora thực sự lớn [như Wikipedia], bạn nên kiểm tra các thư viện xử lý ngôn ngữ tự nhiên được thực hiện cho loại điều này. Đây là một vài cái tốt.
- NLTK
- GENSIM
- Spacy
Xem thảo luận
Cải thiện bài viết
Lưu bài viết
Xem thảo luận
Cải thiện bài viết
Lưu bài viết
Đọc is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle
between them.
Similarity = [A.B] / [||A||.||B||] where A and B are vectors.
Bàn luận
1. Open terminal[Linux]. 2. sudo pip3 install nltk 3. python3 4. import nltk 5. nltk.download[‘all’]
Độ tương tự cosine là thước đo sự tương đồng giữa hai vectơ khác không của không gian sản phẩm bên trong đo cosin của góc giữa chúng. B là vectơ.
Tương tự cosine và mô -đun công cụ NLTK được sử dụng trong chương trình này. Để thực hiện chương trình này, NLTK phải được cài đặt trong hệ thống của bạn. Để cài đặt mô -đun NLTK, hãy làm theo các bước bên dưới - It is used for tokenization. Tokenization is the process by which big quantity of text is divided into smaller parts called tokens.
word_tokenize[X]
split the given sentence X into words and return list.Các chức năng được sử dụng: In this program, it is used to get a list of stopwords. A stop word is a commonly used word [such as “the”, “a”, “an”, “in”].
nltk.tokenize: Nó được sử dụng để mã hóa. Mã thông báo là quá trình mà số lượng lớn văn bản được chia thành các phần nhỏ hơn được gọi là mã thông báo. word_tokenize[X]
Chia câu X đã cho thành các từ và danh sách trả về.
nltk.corpus: Trong chương trình này, nó được sử dụng để có được một danh sách các từ dừng. Một từ dừng là một từ thường được sử dụng [chẳng hạn như là The The The, một, A A, một, một trong những người khác].
Dưới đây là triển khai Python -
import numpy as np
# create a numpy array [vector] for each input, filled with zeros
a = np.zeros[len[vocab]]
b = np.zeros[len[vocab]]
# loop through each input and create a corresponding vector for it
# this vector counts occurrences of each word in the dictionary
for word in A:
index = vocab[word] # get index from dictionary
a[index] += 1 # increment count for that index
for word in B:
index = vocab[word]
b[index] += 1
7import numpy as np
# create a numpy array [vector] for each input, filled with zeros
a = np.zeros[len[vocab]]
b = np.zeros[len[vocab]]
# loop through each input and create a corresponding vector for it
# this vector counts occurrences of each word in the dictionary
for word in A:
index = vocab[word] # get index from dictionary
a[index] += 1 # increment count for that index
for word in B:
index = vocab[word]
b[index] += 1
8import numpy as np
# create a numpy array [vector] for each input, filled with zeros
a = np.zeros[len[vocab]]
b = np.zeros[len[vocab]]
# loop through each input and create a corresponding vector for it
# this vector counts occurrences of each word in the dictionary
for word in A:
index = vocab[word] # get index from dictionary
a[index] += 1 # increment count for that index
for word in B:
index = vocab[word]
b[index] += 1
9# use numpy's dot product to calculate the cosine similarity
sim = np.dot[a, b] / np.sqrt[np.dot[a, a] * np.dot[b, b]]
0import numpy as np
# create a numpy array [vector] for each input, filled with zeros
a = np.zeros[len[vocab]]
b = np.zeros[len[vocab]]
# loop through each input and create a corresponding vector for it
# this vector counts occurrences of each word in the dictionary
for word in A:
index = vocab[word] # get index from dictionary
a[index] += 1 # increment count for that index
for word in B:
index = vocab[word]
b[index] += 1
8# use numpy's dot product to calculate the cosine similarity
sim = np.dot[a, b] / np.sqrt[np.dot[a, a] * np.dot[b, b]]
2from
import numpy as np
# create a numpy array [vector] for each input, filled with zeros
a = np.zeros[len[vocab]]
b = np.zeros[len[vocab]]
# loop through each input and create a corresponding vector for it
# this vector counts occurrences of each word in the dictionary
for word in A:
index = vocab[word] # get index from dictionary
a[index] += 1 # increment count for that index
for word in B:
index = vocab[word]
b[index] += 1
0import numpy as np
# create a numpy array [vector] for each input, filled with zeros
a = np.zeros[len[vocab]]
b = np.zeros[len[vocab]]
# loop through each input and create a corresponding vector for it
# this vector counts occurrences of each word in the dictionary
for word in A:
index = vocab[word] # get index from dictionary
a[index] += 1 # increment count for that index
for word in B:
index = vocab[word]
b[index] += 1
1 import numpy as np
# create a numpy array [vector] for each input, filled with zeros
a = np.zeros[len[vocab]]
b = np.zeros[len[vocab]]
# loop through each input and create a corresponding vector for it
# this vector counts occurrences of each word in the dictionary
for word in A:
index = vocab[word] # get index from dictionary
a[index] += 1 # increment count for that index
for word in B:
index = vocab[word]
b[index] += 1
2from
import numpy as np
# create a numpy array [vector] for each input, filled with zeros
a = np.zeros[len[vocab]]
b = np.zeros[len[vocab]]
# loop through each input and create a corresponding vector for it
# this vector counts occurrences of each word in the dictionary
for word in A:
index = vocab[word] # get index from dictionary
a[index] += 1 # increment count for that index
for word in B:
index = vocab[word]
b[index] += 1
4import numpy as np
# create a numpy array [vector] for each input, filled with zeros
a = np.zeros[len[vocab]]
b = np.zeros[len[vocab]]
# loop through each input and create a corresponding vector for it
# this vector counts occurrences of each word in the dictionary
for word in A:
index = vocab[word] # get index from dictionary
a[index] += 1 # increment count for that index
for word in B:
index = vocab[word]
b[index] += 1
1 import numpy as np
# create a numpy array [vector] for each input, filled with zeros
a = np.zeros[len[vocab]]
b = np.zeros[len[vocab]]
# loop through each input and create a corresponding vector for it
# this vector counts occurrences of each word in the dictionary
for word in A:
index = vocab[word] # get index from dictionary
a[index] += 1 # increment count for that index
for word in B:
index = vocab[word]
b[index] += 1
6# use numpy's dot product to calculate the cosine similarity
sim = np.dot[a, b] / np.sqrt[np.dot[a, a] * np.dot[b, b]]
3import numpy as np
# create a numpy array [vector] for each input, filled with zeros
a = np.zeros[len[vocab]]
b = np.zeros[len[vocab]]
# loop through each input and create a corresponding vector for it
# this vector counts occurrences of each word in the dictionary
for word in A:
index = vocab[word] # get index from dictionary
a[index] += 1 # increment count for that index
for word in B:
index = vocab[word]
b[index] += 1
8 # use numpy's dot product to calculate the cosine similarity
sim = np.dot[a, b] / np.sqrt[np.dot[a, a] * np.dot[b, b]]
51. Open terminal[Linux]. 2. sudo pip3 install nltk 3. python3 4. import nltk 5. nltk.download[‘all’]4
import numpy as np
# create a numpy array [vector] for each input, filled with zeros
a = np.zeros[len[vocab]]
b = np.zeros[len[vocab]]
# loop through each input and create a corresponding vector for it
# this vector counts occurrences of each word in the dictionary
for word in A:
index = vocab[word] # get index from dictionary
a[index] += 1 # increment count for that index
for word in B:
index = vocab[word]
b[index] += 1
81. Open terminal[Linux]. 2. sudo pip3 install nltk 3. python3 4. import nltk 5. nltk.download[‘all’]6
import numpy as np
# create a numpy array [vector] for each input, filled with zeros
a = np.zeros[len[vocab]]
b = np.zeros[len[vocab]]
# loop through each input and create a corresponding vector for it
# this vector counts occurrences of each word in the dictionary
for word in A:
index = vocab[word] # get index from dictionary
a[index] += 1 # increment count for that index
for word in B:
index = vocab[word]
b[index] += 1
81. Open terminal[Linux]. 2. sudo pip3 install nltk 3. python3 4. import nltk 5. nltk.download[‘all’]8
# use numpy's dot product to calculate the cosine similarity
sim = np.dot[a, b] / np.sqrt[np.dot[a, a] * np.dot[b, b]]
6import numpy as np
# create a numpy array [vector] for each input, filled with zeros
a = np.zeros[len[vocab]]
b = np.zeros[len[vocab]]
# loop through each input and create a corresponding vector for it
# this vector counts occurrences of each word in the dictionary
for word in A:
index = vocab[word] # get index from dictionary
a[index] += 1 # increment count for that index
for word in B:
index = vocab[word]
b[index] += 1
8 # use numpy's dot product to calculate the cosine similarity
sim = np.dot[a, b] / np.sqrt[np.dot[a, a] * np.dot[b, b]]
8vocab
1
import numpy as np
# create a numpy array [vector] for each input, filled with zeros
a = np.zeros[len[vocab]]
b = np.zeros[len[vocab]]
# loop through each input and create a corresponding vector for it
# this vector counts occurrences of each word in the dictionary
for word in A:
index = vocab[word] # get index from dictionary
a[index] += 1 # increment count for that index
for word in B:
index = vocab[word]
b[index] += 1
8 similarity: 0.28867513459481291
similarity: 0.28867513459481292
similarity: 0.28867513459481293
similarity: 0.28867513459481294
# use numpy's dot product to calculate the cosine similarity
sim = np.dot[a, b] / np.sqrt[np.dot[a, a] * np.dot[b, b]]
6__numpy
3
import numpy as np
# create a numpy array [vector] for each input, filled with zeros
a = np.zeros[len[vocab]]
b = np.zeros[len[vocab]]
# loop through each input and create a corresponding vector for it
# this vector counts occurrences of each word in the dictionary
for word in A:
index = vocab[word] # get index from dictionary
a[index] += 1 # increment count for that index
for word in B:
index = vocab[word]
b[index] += 1
8 numpy
5similarity: 0.28867513459481292
similarity: 0.28867513459481293
similarity: 0.28867513459481294
numpy
9sim
0
similarity: 0.28867513459481296
similarity: 0.28867513459481293
similarity: 0.28867513459481294
sim
4sim
55____76sim
0sim
8sim
9word_tokenize[X]
0word_tokenize[X]
1
sim
0
similarity: 0.28867513459481296
similarity: 0.28867513459481293
similarity: 0.28867513459481294
word_tokenize[X]
6sim
5word_tokenize[X]
1sim
0sim
8from
1word_tokenize[X]
0word_tokenize[X]
1
from
4
import numpy as np
# create a numpy array [vector] for each input, filled with zeros
a = np.zeros[len[vocab]]
b = np.zeros[len[vocab]]
# loop through each input and create a corresponding vector for it
# this vector counts occurrences of each word in the dictionary
for word in A:
index = vocab[word] # get index from dictionary
a[index] += 1 # increment count for that index
for word in B:
index = vocab[word]
b[index] += 1
8 word_tokenize[X]
0similarity: 0.28867513459481292
from
8similarity: 0.28867513459481294
import numpy as np
# create a numpy array [vector] for each input, filled with zeros
a = np.zeros[len[vocab]]
b = np.zeros[len[vocab]]
# loop through each input and create a corresponding vector for it
# this vector counts occurrences of each word in the dictionary
for word in A:
index = vocab[word] # get index from dictionary
a[index] += 1 # increment count for that index
for word in B:
index = vocab[word]
b[index] += 1
00import numpy as np
# create a numpy array [vector] for each input, filled with zeros
a = np.zeros[len[vocab]]
b = np.zeros[len[vocab]]
# loop through each input and create a corresponding vector for it
# this vector counts occurrences of each word in the dictionary
for word in A:
index = vocab[word] # get index from dictionary
a[index] += 1 # increment count for that index
for word in B:
index = vocab[word]
b[index] += 1
01import numpy as np
# create a numpy array [vector] for each input, filled with zeros
a = np.zeros[len[vocab]]
b = np.zeros[len[vocab]]
# loop through each input and create a corresponding vector for it
# this vector counts occurrences of each word in the dictionary
for word in A:
index = vocab[word] # get index from dictionary
a[index] += 1 # increment count for that index
for word in B:
index = vocab[word]
b[index] += 1
022import numpy as np
# create a numpy array [vector] for each input, filled with zeros
a = np.zeros[len[vocab]]
b = np.zeros[len[vocab]]
# loop through each input and create a corresponding vector for it
# this vector counts occurrences of each word in the dictionary
for word in A:
index = vocab[word] # get index from dictionary
a[index] += 1 # increment count for that index
for word in B:
index = vocab[word]
b[index] += 1
04import numpy as np
# create a numpy array [vector] for each input, filled with zeros
a = np.zeros[len[vocab]]
b = np.zeros[len[vocab]]
# loop through each input and create a corresponding vector for it
# this vector counts occurrences of each word in the dictionary
for word in A:
index = vocab[word] # get index from dictionary
a[index] += 1 # increment count for that index
for word in B:
index = vocab[word]
b[index] += 1
05import numpy as np
# create a numpy array [vector] for each input, filled with zeros
a = np.zeros[len[vocab]]
b = np.zeros[len[vocab]]
# loop through each input and create a corresponding vector for it
# this vector counts occurrences of each word in the dictionary
for word in A:
index = vocab[word] # get index from dictionary
a[index] += 1 # increment count for that index
for word in B:
index = vocab[word]
b[index] += 1
06import numpy as np
# create a numpy array [vector] for each input, filled with zeros
a = np.zeros[len[vocab]]
b = np.zeros[len[vocab]]
# loop through each input and create a corresponding vector for it
# this vector counts occurrences of each word in the dictionary
for word in A:
index = vocab[word] # get index from dictionary
a[index] += 1 # increment count for that index
for word in B:
index = vocab[word]
b[index] += 1
8 import numpy as np
# create a numpy array [vector] for each input, filled with zeros
a = np.zeros[len[vocab]]
b = np.zeros[len[vocab]]
# loop through each input and create a corresponding vector for it
# this vector counts occurrences of each word in the dictionary
for word in A:
index = vocab[word] # get index from dictionary
a[index] += 1 # increment count for that index
for word in B:
index = vocab[word]
b[index] += 1
08import numpy as np
# create a numpy array [vector] for each input, filled with zeros
a = np.zeros[len[vocab]]
b = np.zeros[len[vocab]]
# loop through each input and create a corresponding vector for it
# this vector counts occurrences of each word in the dictionary
for word in A:
index = vocab[word] # get index from dictionary
a[index] += 1 # increment count for that index
for word in B:
index = vocab[word]
b[index] += 1
09import numpy as np
# create a numpy array [vector] for each input, filled with zeros
a = np.zeros[len[vocab]]
b = np.zeros[len[vocab]]
# loop through each input and create a corresponding vector for it
# this vector counts occurrences of each word in the dictionary
for word in A:
index = vocab[word] # get index from dictionary
a[index] += 1 # increment count for that index
for word in B:
index = vocab[word]
b[index] += 1
10import numpy as np
# create a numpy array [vector] for each input, filled with zeros
a = np.zeros[len[vocab]]
b = np.zeros[len[vocab]]
# loop through each input and create a corresponding vector for it
# this vector counts occurrences of each word in the dictionary
for word in A:
index = vocab[word] # get index from dictionary
a[index] += 1 # increment count for that index
for word in B:
index = vocab[word]
b[index] += 1
11import numpy as np
# create a numpy array [vector] for each input, filled with zeros
a = np.zeros[len[vocab]]
b = np.zeros[len[vocab]]
# loop through each input and create a corresponding vector for it
# this vector counts occurrences of each word in the dictionary
for word in A:
index = vocab[word] # get index from dictionary
a[index] += 1 # increment count for that index
for word in B:
index = vocab[word]
b[index] += 1
8 from
4import numpy as np
# create a numpy array [vector] for each input, filled with zeros
a = np.zeros[len[vocab]]
b = np.zeros[len[vocab]]
# loop through each input and create a corresponding vector for it
# this vector counts occurrences of each word in the dictionary
for word in A:
index = vocab[word] # get index from dictionary
a[index] += 1 # increment count for that index
for word in B:
index = vocab[word]
b[index] += 1
14 import numpy as np
# create a numpy array [vector] for each input, filled with zeros
a = np.zeros[len[vocab]]
b = np.zeros[len[vocab]]
# loop through each input and create a corresponding vector for it
# this vector counts occurrences of each word in the dictionary
for word in A:
index = vocab[word] # get index from dictionary
a[index] += 1 # increment count for that index
for word in B:
index = vocab[word]
b[index] += 1
15import numpy as np
# create a numpy array [vector] for each input, filled with zeros
a = np.zeros[len[vocab]]
b = np.zeros[len[vocab]]
# loop through each input and create a corresponding vector for it
# this vector counts occurrences of each word in the dictionary
for word in A:
index = vocab[word] # get index from dictionary
a[index] += 1 # increment count for that index
for word in B:
index = vocab[word]
b[index] += 1
16import numpy as np
# create a numpy array [vector] for each input, filled with zeros
a = np.zeros[len[vocab]]
b = np.zeros[len[vocab]]
# loop through each input and create a corresponding vector for it
# this vector counts occurrences of each word in the dictionary
for word in A:
index = vocab[word] # get index from dictionary
a[index] += 1 # increment count for that index
for word in B:
index = vocab[word]
b[index] += 1
17import numpy as np
# create a numpy array [vector] for each input, filled with zeros
a = np.zeros[len[vocab]]
b = np.zeros[len[vocab]]
# loop through each input and create a corresponding vector for it
# this vector counts occurrences of each word in the dictionary
for word in A:
index = vocab[word] # get index from dictionary
a[index] += 1 # increment count for that index
for word in B:
index = vocab[word]
b[index] += 1
18import numpy as np
# create a numpy array [vector] for each input, filled with zeros
a = np.zeros[len[vocab]]
b = np.zeros[len[vocab]]
# loop through each input and create a corresponding vector for it
# this vector counts occurrences of each word in the dictionary
for word in A:
index = vocab[word] # get index from dictionary
a[index] += 1 # increment count for that index
for word in B:
index = vocab[word]
b[index] += 1
09import numpy as np
# create a numpy array [vector] for each input, filled with zeros
a = np.zeros[len[vocab]]
b = np.zeros[len[vocab]]
# loop through each input and create a corresponding vector for it
# this vector counts occurrences of each word in the dictionary
for word in A:
index = vocab[word] # get index from dictionary
a[index] += 1 # increment count for that index
for word in B:
index = vocab[word]
b[index] += 1
17import numpy as np
# create a numpy array [vector] for each input, filled with zeros
a = np.zeros[len[vocab]]
b = np.zeros[len[vocab]]
# loop through each input and create a corresponding vector for it
# this vector counts occurrences of each word in the dictionary
for word in A:
index = vocab[word] # get index from dictionary
a[index] += 1 # increment count for that index
for word in B:
index = vocab[word]
b[index] += 1
21import numpy as np
# create a numpy array [vector] for each input, filled with zeros
a = np.zeros[len[vocab]]
b = np.zeros[len[vocab]]
# loop through each input and create a corresponding vector for it
# this vector counts occurrences of each word in the dictionary
for word in A:
index = vocab[word] # get index from dictionary
a[index] += 1 # increment count for that index
for word in B:
index = vocab[word]
b[index] += 1
09import numpy as np
# create a numpy array [vector] for each input, filled with zeros
a = np.zeros[len[vocab]]
b = np.zeros[len[vocab]]
# loop through each input and create a corresponding vector for it
# this vector counts occurrences of each word in the dictionary
for word in A:
index = vocab[word] # get index from dictionary
a[index] += 1 # increment count for that index
for word in B:
index = vocab[word]
b[index] += 1
09__import numpy as np
# create a numpy array [vector] for each input, filled with zeros
a = np.zeros[len[vocab]]
b = np.zeros[len[vocab]]
# loop through each input and create a corresponding vector for it
# this vector counts occurrences of each word in the dictionary
for word in A:
index = vocab[word] # get index from dictionary
a[index] += 1 # increment count for that index
for word in B:
index = vocab[word]
b[index] += 1
26import numpy as np
# create a numpy array [vector] for each input, filled with zeros
a = np.zeros[len[vocab]]
b = np.zeros[len[vocab]]
# loop through each input and create a corresponding vector for it
# this vector counts occurrences of each word in the dictionary
for word in A:
index = vocab[word] # get index from dictionary
a[index] += 1 # increment count for that index
for word in B:
index = vocab[word]
b[index] += 1
01import numpy as np
# create a numpy array [vector] for each input, filled with zeros
a = np.zeros[len[vocab]]
b = np.zeros[len[vocab]]
# loop through each input and create a corresponding vector for it
# this vector counts occurrences of each word in the dictionary
for word in A:
index = vocab[word] # get index from dictionary
a[index] += 1 # increment count for that index
for word in B:
index = vocab[word]
b[index] += 1
28import numpy as np
# create a numpy array [vector] for each input, filled with zeros
a = np.zeros[len[vocab]]
b = np.zeros[len[vocab]]
# loop through each input and create a corresponding vector for it
# this vector counts occurrences of each word in the dictionary
for word in A:
index = vocab[word] # get index from dictionary
a[index] += 1 # increment count for that index
for word in B:
index = vocab[word]
b[index] += 1
29Output:
similarity: 0.2886751345948129