1. Introduction of TF-IDF algorithm
TF-IDF (term frequency-inverse document frequency) is a commonly used weighting technology for information retrieval and text mining.
TF-IDF is a statistical method used to assess the importance of a word to a document set or one of the documents in a corpus. The importance of words increases with the number of times they appear in documents, but decreases inversely with the frequency they appear in corpus.
The main idea of TF-IDF is that if a word appears frequently in an article and rarely in other articles, it is considered that the word or phrase has good classification ability and is suitable for classification.
(1) TF is Term Frequency.
Word frequency (TF) denotes the frequency of entries (keywords) appearing in text.
This number is usually normalized (usually by dividing the word frequency by the total number of words in the article) to prevent it from leaning toward long files.
Formula: namely:
Among them, Ni and j are the number of occurrences of the word in the document dj, and denominator is the sum of occurrences of all words in the document dj.
(2) IDF is Inverse Document Frequency
Reverse File Frequency (IDF): The IDF of a particular word can be obtained by dividing the total number of files by the number of files containing that word.
If fewer documents contain entry t and larger IDF, the entry has a good ability to distinguish categories.
Formula:
Among them, | D | is the total number of files in the corpus. | {j:ti < dj} | denotes the number of files containing the word ti (i.e., ni,j_0). If the word is not in the corpus, it will cause the denominator to be zero, so in general, 1+|{j:ti < dj} is used.|
That is:
(3) TF-IDF is actually: TF*IDF
High word frequencies in a particular file and low file frequencies in the entire file set can generate high-weight TF-IDF. Therefore, TF-IDF tends to filter out common words and retain important ones.
Formula:
Goang implements TF-IDF algorithm
1 package main
2
3 import (
4 "fmt"
5 "math"
6 "sort"
7 "time"
8 )
9
10 type wordTfIdf struct {
11 nworld string
12 value float64
13 }
14
15 func main() {
16 start := currentTimeMillis()
17 FeatureSelect(Load())
18
19
20 cost := currentTimeMillis() - start
21 fmt.Printf("time consuming %d ms ",cost)
22
23 }
24
25 type wordTfIdfs []wordTfIdf
26 type Interface interface {
27 Len() int
28 Less(i, j int) bool
29 Swap(i, j int)
30 }
31
32 func (us wordTfIdfs) Len() int {
33 return len(us)
34 }
35 func (us wordTfIdfs) Less(i, j int) bool {
36 return us[i].value > us[j].value
37 }
38 func (us wordTfIdfs) Swap(i, j int) {
39 us[i], us[j] = us[j], us[i]
40 }
41
42 func currentTimeMillis() int64 {
43 return time.Now().UnixNano() / 1000000
44 }
45 func FeatureSelect(list_words [][]string) {
46 docFrequency := make(map[string]float64, 0)
47 sumWorlds := 0;
48 for _, wordList := range list_words {
49 for _, v := range wordList {
50 docFrequency[v] += 1
51 sumWorlds++;
52 }
53 }
54 wordTf := make(map[string]float64)
55 for k, _ := range docFrequency {
56 wordTf[k] = docFrequency[k] / float64(sumWorlds)
57 }
58 docNum := float64(len(list_words))
59 wordIdf := make(map[string]float64)
60 wordDoc := make(map[string]float64, 0)
61 for k, _ := range docFrequency {
62 for _, v := range list_words {
63 for _, vs := range v {
64 if (k == vs) {
65 wordDoc[k] += 1
66 break
67 }
68 }
69 }
70 }
71 for k, _ := range docFrequency {
72 wordIdf[k] = math.Log(docNum / (wordDoc[k] + 1))
73 }
74 var wordifS wordTfIdfs
75 for k, _ := range docFrequency {
76 var wti wordTfIdf
77 wti.nworld = k
78 wti.value = wordTf[k] * wordIdf[k]
79 wordifS = append(wordifS, wti)
80 }
81 sort.Sort(wordifS)
82 fmt.Println(wordifS)
83 }
84
85 func Load() [][]string {
86 slice := [][]string{
87 {"my", "dog", "has", "flea", "problems", "help", "please"},
88 {"maybe", "not", "take", "him", "to", "dog", "park", "stupid"},
89 {"my", "dalmation", "is", "so", "cute", "I", "love", "him"},
90 {"stop", "posting", "stupid", "worthless", "garbage"},
91 {"mr", "licks", "ate", "my", "steak", "how", "to", "stop", "him"},
92 {"quit", "buying", "worthless", "dog", "food", "stupid"},
93 }
94 return slice
95 }