Feature extraction
Detailed explanations are recorded in Portal
I'm just here to sort out the APIs I've learned about.
Loading features from dicts
This is convenient to extract data features, such as our data is dict form, there are three different cities, you can one-hot encode.
Using the Dict Vectorizer module
>>> measurements = [ ... {'city': 'Dubai', 'temperature': 33.}, ... {'city': 'London', 'temperature': 12.}, ... {'city': 'San Fransisco', 'temperature': 18.}, ... ] >>> from sklearn.feature_extraction import DictVectorizer >>> vec = DictVectorizer() >>> vec.fit_transform(measurements).toarray() array([[ 1., 0., 0., 33.], [ 0., 1., 0., 12.], [ 0., 0., 1., 18.]]) >>> vec.get_feature_names() ['city=Dubai', 'city=London', 'city=San Fransisco', 'temperature']
Here's another example of pos_window. I haven't done it in terms of part of speech. But I thought at first that this way is not good in this situation, because there are many zeros, but after looking closely, I think it's not. I hope someone can help me answer them.
The following is a copy of the original abstract in English.
For example, suppose that we have a first algorithm that extracts Part of Speech (PoS) tags that we want to use as complementary tags for training a sequence classifier (e.g. a chunker). The following dict could be such a window of features extracted around the word 'sat' in the sentence 'The cat sat on the mat.':
>>> >>> pos_window = [ ... { ... 'word-2': 'the', ... 'pos-2': 'DT', ... 'word-1': 'cat', ... 'pos-1': 'NN', ... 'word+1': 'on', ... 'pos+1': 'PP', ... }, ... # in a real application one would extract many such dictionaries ... ]
This description can be vectorized into a sparse two-dimensional matrix suitable for feeding into a classifier (maybe after being piped into a text.TfidfTransformer for normalization):
>>> >>> vec = DictVectorizer() >>> pos_vectorized = vec.fit_transform(pos_window) >>> pos_vectorized <1x6 sparse matrix of type '<... 'numpy.float64'>' with 6 stored elements in Compressed Sparse ... format> >>> pos_vectorized.toarray() array([[ 1., 1., 1., 1., 1., 1.]]) >>> vec.get_feature_names() ['pos+1=PP', 'pos-1=NN', 'pos-2=DT', 'word+1=on', 'word-1=cat', 'word-2=the']
As you can imagine, if one extracts such a context around each individual word of a corpus of documents the resulting matrix will be very wide (many one-hot-features) with most of them being valued to zero most of the time. So as to make the resulting data structure able to fit in memory the DictVectorizer class uses a scipy.sparse matrix by default instead of a numpy.ndarray.
Let's leave this part aside and go on.
Feature hashing
The feature Hasher class is used to vectorize high speed and low memory usage. The technology used is feature hashing. Because we haven't touched on this aspect yet, we won't talk about it in detail.
Based on murmurhash, this is a well-known, previously contacted. Due to the limitation of scipy.sparse, the maximum number of feature s is the upper limit.
$$2^{31}-1$$
Text feature extraction text feature extraction
Common Vectorizer usage
vectorization is the transformation of text sets into digital vectors. This particular strategy, also known as "Bag of words" or "Bag of n-grams", completely ignores the placement of words in the text.
The first is Count Vectorizer.
>>> from sklearn.feature_extraction.text import CountVectorizer
There are many parameters.
>>> vectorizer = CountVectorizer(min_df=1) >>> vectorizer CountVectorizer(analyzer=...'word', binary=False, decode_error=...'strict', dtype=<... 'numpy.int64'>, encoding=...'utf-8', input=...'content', lowercase=True, max_df=1.0, max_features=None, min_df=1, ngram_range=(1, 1), preprocessor=None, stop_words=None, strip_accents=None, token_pattern=...'(?u)\\b\\w\\w+\\b', tokenizer=None, vocabulary=None)
Here's a little use
>>> corpus = [ ... 'This is the first document.', ... 'This is the second second document.', ... 'And the third one.', ... 'Is this the first document?', ... ] >>> X = vectorizer.fit_transform(corpus) >>> X <4x9 sparse matrix of type '<... 'numpy.int64'>' with 19 stored elements in Compressed Sparse ... format>
Result
>>> vectorizer.get_feature_names() == ( ... ['and', 'document', 'first', 'is', 'one', ... 'second', 'the', 'third', 'this']) True >>> X.toarray() array([[0, 1, 1, 1, 0, 0, 1, 0, 1], [0, 1, 0, 1, 0, 2, 1, 0, 1], [1, 0, 0, 0, 1, 0, 1, 1, 0], [0, 1, 1, 1, 0, 0, 1, 0, 1]]...)
It can be seen that this is based on the number of feature s to statistics, belonging to one-hot, generally not practical.
Tf–idf term weighting
This is better. I won't talk about tf-idf. The principle is simple.
Here's an example. In count, there are only three words in the number of words that have been calculated.
>>> counts = [[3, 0, 1], ... [2, 0, 0], ... [3, 0, 0], ... [4, 0, 0], ... [3, 2, 0], ... [3, 0, 2]] ... >>> tfidf = transformer.fit_transform(counts) >>> tfidf <6x3 sparse matrix of type '<... 'numpy.float64'>' with 9 stored elements in Compressed Sparse ... format> >>> tfidf.toarray() array([[ 0.81940995, 0. , 0.57320793], [ 1. , 0. , 0. ], [ 1. , 0. , 0. ], [ 1. , 0. , 0. ], [ 0.47330339, 0.88089948, 0. ], [ 0.58149261, 0. , 0.81355169]])
Specifically in the project is used as follows.
>>> from sklearn.feature_extraction.text import TfidfVectorizer >>> vectorizer = TfidfVectorizer(min_df=1) >>> vectorizer.fit_transform(corpus)
Vectorizing a large text corpus with the hashing trick
Using hash techniques to adapt to large data sets, unused, looks awesome
<font color=#00aaaa >The above vectorization scheme is simple but the fact that it holds an in- memory mapping from the string tokens to the integer feature indices (the vocabulary_ attribute) causes several problems when dealing with large datasets:
the larger the corpus, the larger the vocabulary will grow and hence the memory use too,
fitting requires the allocation of intermediate data structures of size proportional to that of the original dataset. building the word-mapping requires a full pass over the dataset hence it is not possible to fit text classifiers in a strictly online manner. pickling and un-pickling vectorizers with a large vocabulary_ can be very slow (typically much slower than pickling / un-pickling flat data structures such as a NumPy array of the same size), it is not easily possible to split the vectorization work into concurrent sub tasks as the vocabulary_ attribute would have to be a shared state with a fine grained synchronization barrier: the mapping from token string to feature index is dependent on ordering of the first occurrence of each token hence would have to be shared, potentially harming the concurrent workers' performance to the point of making them slower than the sequential variant.</font>
>>> from sklearn.feature_extraction.text import HashingVectorizer >>> hv = HashingVectorizer(n_features=10) >>> hv.transform(corpus)