Initial AI: Machine Learning: sklearn Feature Extraction

Keywords: Python pip

1. sklearn feature extraction

1.1 Install sklearn

pip install Scikit-learn -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com

No error, Import command to see if available:

import sklearn

Note: Numpy,pandas and other libraries are required to install scikit-learn s.

1.2 Feature Extraction

Example:

# feature extraction

# Import Package
from sklearn.feature_extraction.text import CountVectorizer

# instantiation CountVectorizer
vector = CountVectorizer()

# call fit_transform Enter and convert data
res = vector.fit_transform(["life is short,i like python","life is too long,i dislike python"])

# Print results
print(vector.get_feature_names())

print(res.toarray())

Run result:

With an example, we can conclude that feature extraction is used to eigenvaluate data such as text.(

1.3 Dictionary Feature Extraction

Role: Feature dictionary data.

Class: sklearn.feature_extraction.DictVectorizer

DictVectorizer syntax:

DictVectorizer(sparse=True,...)

  • DictVectorizer.fit_transform(X)
    • X: Dictionary or Iterator with Dictionary
    • Return value: Return sparse matrix
  • DictVectorizer.inverse_transform(X)
    • X:array array array or sparse matrix
    • Return value: data format before conversion
  • DictVectorizer.get_feature_names()
    • - Return category name
  • DictVectorizer.transform(X)
    • Conversion to original standards
from sklearn.feature_extraction import DictVectorizer

def dictvec():
    """
    //Dictionary Data Extraction
    :return: None
    """
    # instantiation
    dict = DictVectorizer()

    # call fit_transform
    data = dict.fit_transform([{'city': 'Beijing','temperature': 100}, {'city': 'Shanghai','temperature':60}, {'city': 'Shenzhen','temperature': 30}])

    print(dict.get_feature_names())

    print(dict.inverse_transform(data))

    print(data)

    return None

if __name__ == "__main__":
    dictvec()

Run result:

Modify attributes to make the data more intuitive.

from sklearn.feature_extraction import DictVectorizer

def dictvec():
    """
    //Dictionary Data Extraction
    :return: None
    """
    # instantiation
    dict = DictVectorizer(sparse=False)

    # call fit_transform
    data = dict.fit_transform([{'city': 'Beijing','temperature': 100}, {'city': 'Shanghai','temperature':60}, {'city': 'Shenzhen','temperature': 30}])

    print(dict.get_feature_names())

    print(dict.inverse_transform(data))

    print(data)

    return None

if __name__ == "__main__":
    dictvec()

Run result:

1.4 Text Feature Extraction

Role: Feature text data.

Class: sklearn.feature_extraction.text.CountVectorizer

CountVectorizer syntax:

CountVectorizer(max_df=1.0,min_df=1,...)

  • Return frequency matrix
  • CountVectorizer.fit_transform(X,y)
    • X: Text or Iterable Objects Containing Text Strings
    • Return value: Return sparse matrix
  • CountVectorizer.inverse_transform(X)
    • X:array array array or sparse matrix
    • Return value: data format before conversion
  • CountVectorizer.get_feature_names()
    • Return value: Word list
from sklearn.feature_extraction.text import CountVectorizer

def countvec():
    """
    //Eigenvaluating text
    :return: None
    """
    cv = CountVectorizer()

    data = cv.fit_transform(["Life is short, I like it python", "Life is long, don't use it python"])

    print(cv.get_feature_names())

    print(data.toarray())

    return None

if __name__ == "__main__":
    countvec()

Run result:

When we are working with text, it is impossible to split words one by one, so we will use a tool called jieba.(

 pip install jieba -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com

from sklearn.feature_extraction.text import CountVectorizer
import jieba

def cutword():

    con1 = jieba.cut("Today is cruel, tomorrow is crueller, the day after tomorrow is beautiful, but most of them die tomorrow evening, so everyone should not give up today.")

    con2 = jieba.cut("We see light coming from distant galaxies millions of years ago, so when we see the universe, we are looking at its past.")

    con3 = jieba.cut("If you know something in only one way, you don't really know it.The secret to knowing what things really mean depends on how they relate to what we know.")

    # Convert to List
    content1 = list(con1)
    content2 = list(con2)
    content3 = list(con3)

    # Bar list to string
    c1 = ' '.join(content1)
    c2 = ' '.join(content2)
    c3 = ' '.join(content3)

    return c1, c2, c3

def hanzivec():
    """
    //Chinese eigenvaluation
    :return: None
    """
    c1, c2, c3 = cutword()

    print(c1, c2, c3)

    cv = CountVectorizer()

    data = cv.fit_transform([c1, c2, c3])

    print(cv.get_feature_names())

    print(data.toarray())

    return None

if __name__ == "__main__":
    hanzivec()

Run result:

Building prefix dict from the default dictionary ...
Dumping model to file cache C:\Users\ACER\AppData\Local\Temp\jieba.cache
Loading model cost 0.839 seconds.
Prefix dict has been built successfully.
Today is cruel, tomorrow is crueller, the day after tomorrow is beautiful, but most of them die tomorrow evening, so everyone should not give up today.We see light coming from distant galaxies millions of years ago, so when we see the universe, we are looking at its past.If you know something in only one way, you don't really know it.The secret to knowing what things really mean depends on how they relate to what we know.
['one','won','no','don','don','before','know','things','today','Light is','millions of years','emit','depend','Just use','acquired','acquired','meaning','mostly','how','most','how','if','cosmic','us','us','so','Give up','way','tomorrow','galaxy','galaxy''','night','something','cruel','every','see'',''','''','''''', real secretsecret,''''','''''''secret''Absolute','Beautiful','Contact','Past','This']
[[0 0 1 0 0 0 2 0 0 0 0 0 1 0 1 0 0 0 0 1 1 0 2 0 1 0 2 1 0 0 0 1 1 0 0 0]
 [0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 1 3 0 0 0 0 1 0 0 0 0 2 0 0 0 0 0 1 1]
 [1 1 0 0 4 3 0 0 0 0 1 1 0 1 0 1 1 0 1 0 0 1 0 0 0 1 0 0 0 2 1 0 0 1 0 0]]

1.5 TF-IDF

The main idea of TF-IDF is that if a word or phrase has a high probability of appearing in one article and is seldom found in other articles, it is considered to be a good category discriminator and suitable for classification.

TF-IDF Role: Used to assess the importance of a word to a file set or one of the files in a corpus.

Class: sklearn.feature_extraction.text.TfidfVectorizer

TfidfVectorizer syntax:

TfidfVectorizer(stop_words=None,...)

  • Returns the weight matrix of words
  • TfidfVectorizer.fit_transform(X,y)
    • X: Text or Iterable Objects Containing Text Strings
    • Return value: Return sparse matrix
  • TfidfVectorizer.inverse_transform(X)
    • X:array array array or sparse matrix
    • Return value: data format before conversion
  • TfidfVectorizer.get_feature_names()
    • Return value: Word list
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import jieba

def cutword():

    con1 = jieba.cut("Today is cruel, tomorrow is crueller, the day after tomorrow is beautiful, but most of them die tomorrow evening, so everyone should not give up today.")

    con2 = jieba.cut("We see light coming from distant galaxies millions of years ago, so when we see the universe, we are looking at its past.")

    con3 = jieba.cut("If you know something in only one way, you don't really know it.The secret to knowing what things really mean depends on how they relate to what we know.")

    # Convert to List
    content1 = list(con1)
    content2 = list(con2)
    content3 = list(con3)

    # Convert List to String
    c1 = ' '.join(content1)
    c2 = ' '.join(content2)
    c3 = ' '.join(content3)

    return c1, c2, c3

def tfidfvec():
    """
    //Chinese eigenvaluation
    :return: None
    """
    c1, c2, c3 = cutword()

    print(c1, c2, c3)

    tf = TfidfVectorizer()

    data = tf.fit_transform([c1, c2, c3])

    print(tf.get_feature_names())

    print(data.toarray())

    return None

if __name__ == "__main__":
    tfidfvec()

Run result:

Building prefix dict from the default dictionary ...
Loading model from cache C:\Users\ACER\AppData\Local\Temp\jieba.cache
Loading model cost 0.633 seconds.
Prefix dict has been built successfully.
Today is cruel, tomorrow is crueller, the day after tomorrow is beautiful, but most of them die tomorrow evening, so everyone should not give up today.We see light coming from distant galaxies millions of years ago, so when we see the universe, we are looking at its past.If you know something in only one way, you don't really know it.The secret to knowing what things really mean depends on how they relate to what we know.
['one','won','no','don','don','before','know','things','today','Light is','millions of years','emit','depend','Just use','acquired','acquired','meaning','mostly','how','most','how','if','cosmic','us','us','so','Give up','way','tomorrow','galaxy','galaxy''','night','something','cruel','every','see'',''','''','''''', real secretsecret,''''','''''''secret''Absolute','Beautiful','Contact','Past','This']
[[0.         0.         0.21821789 0.         0.         0.
  0.43643578 0.         0.         0.         0.         0.
  0.21821789 0.         0.21821789 0.         0.         0.
  0.         0.21821789 0.21821789 0.         0.43643578 0.
  0.21821789 0.         0.43643578 0.21821789 0.         0.
  0.         0.21821789 0.21821789 0.         0.         0.        ]
 [0.         0.         0.         0.2410822  0.         0.
  0.         0.2410822  0.2410822  0.2410822  0.         0.
  0.         0.         0.         0.         0.         0.2410822
  0.55004769 0.         0.         0.         0.         0.2410822
  0.         0.         0.         0.         0.48216441 0.
  0.         0.         0.         0.         0.2410822  0.2410822 ]
 [0.15698297 0.15698297 0.         0.         0.62793188 0.47094891
  0.         0.         0.         0.         0.15698297 0.15698297
  0.         0.15698297 0.         0.15698297 0.15698297 0.
  0.1193896  0.         0.         0.15698297 0.         0.
  0.         0.15698297 0.         0.         0.         0.31396594
  0.15698297 0.         0.         0.15698297 0.         0.        ]]

By contrast, we can see which words have a higher weight in the text.

Posted by miseleigh on Wed, 15 Apr 2020 19:00:21 -0700