Python visual artifact used by Yellowbrick

Keywords: Python pip JSON

Preface

The text and pictures of the article are from the Internet, only for learning and communication, and do not have any commercial use. The copyright belongs to the original author. If you have any questions, please contact us in time for handling.

Author: cold city of Yishui

PS: if you need Python learning materials, you can click the link below to get them by yourself

http://note.youdao.com/noteshare?id=3054cce4add8a909e784ad934f956cef

In the field of machine learning and data mining, data exploration is a very important part of the work for people who come into contact with data processing and analysis, and data visualization will become a powerful tool for data analysis engineers to complete data exploration. This paper mainly introduces Yellowbrick, a visualization tool that I use more in my daily life. It is a more advanced visualization tool based on sklearn+matplotlib module, which can more easily complete a lot of data exploration, segmentation and display work.

The best way to learn to use a module is to learn the API it provides. Here are some good reference addresses:

1) official document address (English)

https://www.scikit-yb.org/en/latest/

2) official document address (Chinese)

http://www.scikit-yb.org/zh/latest/

Yellowbrick is a set of visual diagnostic tools called "Visualizers", which is extended from the scikit learn API and guides the model selection process. In a word, yellowbrick combines scikit learn and Matplotlib and best inherits scikit learn documents to visualize your model!

To understand yellowbrick, we must first understand the Visualizers, which are the objects that the estimators learn from the data. Its main task is to generate views that can have a deeper understanding of the model selection process. From the perspective of scikit learn, when visualizing data space or encapsulating a model estimator, it is similar to the transformer, just like the working principle of "model CV" (such as RidgeCV,LassoCV). Yellowbrick's main goal is to create a meaningful API similar to scikit learn.

The most popular visualizers in Yellowbrick include:

Such a powerful visualization tool can be installed in a simple way. Use the following command:

pip install yellowbrick

If you need to upgrade the latest version of, you can use the following command:

pip install –u yellowbrick

After the installation, we can use it. This module provides several commonly used datasets for experiment, as follows:

When entering the corresponding dataset folder, there will be three files. For bikeshare, they are as follows:

Where: bikeshare.csv is the data set file, such as:

Meta.json is the field meta information file, such as:

README.md is an introduction document, such as:

Based on the shared single vehicle data set, the simple data analysis work is realized as follows:

 1 def testFunc5(savepath='Results/bikeshare_Rank2D.png'):
 2     '''
 3     Shared single vehicle dataset prediction
 4     '''
 5     data=pd.read_csv('bikeshare/bikeshare.csv')
 6     X=data[["season", "month", "hour", "holiday", "weekday", "workingday",
 7             "weather", "temp", "feelslike", "humidity", "windspeed"
 8           ]]
 9     y=data["riders"]
10     visualizer=Rank2D(algorithm="pearson")
11     visualizer.fit_transform(X)
12     visualizer.poof(outpath=savepath)
13 14 15 def testFunc6(savepath='Results/bikeshare_temperate_feelslike_relation.png'):
16     '''
17     Further examination of relevance
18     '''
19     data=pd.read_csv('bikeshare/bikeshare.csv')
20     X=data[["season", "month", "hour", "holiday", "weekday", "workingday",
21             "weather", "temp", "feelslike", "humidity", "windspeed"]]
22     y=data["riders"]
23     visualizer=JointPlotVisualizer(feature='temp', target='feelslike')
24     visualizer.fit(X['temp'], X['feelslike'])
25     visualizer.poof(outpath=savepath)
26 27 28 def testFunc7(savepath='Results/bikeshare_LinearRegression_ResidualsPlot.png'):
29     '''
30     Prediction using linear regression model based on shared single vehicle data
31     '''
32     data = pd.read_csv('bikeshare/bikeshare.csv')
33     X=data[["season", "month", "hour", "holiday", "weekday", "workingday",
34             "weather", "temp", "feelslike", "humidity", "windspeed"]]
35     y=data["riders"]
36     X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3)
37     visualizer=ResidualsPlot(LinearRegression())
38     visualizer.fit(X_train, y_train)
39     visualizer.score(X_test, y_test)
40     visualizer.poof(outpath=savepath)
41 42 43 def testFunc8(savepath='Results/bikeshare_RidgeCV_AlphaSelection.png'):
44     '''
45     Based on the use of shared single vehicle data AlphaSelection
46     '''
47     data=pd.read_csv('bikeshare/bikeshare.csv')
48     X=data[["season", "month", "hour", "holiday", "weekday", "workingday",
49             "weather", "temp", "feelslike", "humidity", "windspeed"]]
50     y=data["riders"]
51     alphas=np.logspace(-10, 1, 200)
52     visualizer=AlphaSelection(RidgeCV(alphas=alphas))
53     visualizer.fit(X, y)
54     visualizer.poof(outpath=savepath)
55 56 57 def testFunc9(savepath='Results/bikeshare_Ridge_PredictionError.png'):
58     '''
59     Drawing prediction error map based on shared single vehicle data
60     '''
61     data=pd.read_csv('bikeshare/bikeshare.csv')
62     X=data[["season", "month", "hour", "holiday", "weekday", "workingday",
63             "weather", "temp", "feelslike", "humidity", "windspeed"]]
64     y=data["riders"]
65     X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3)
66     visualizer=PredictionError(Ridge(alpha=3.181))
67     visualizer.fit(X_train, y_train)
68     visualizer.score(X_test, y_test)
69     visualizer.poof(outpath=savepath)

 

Correlation calculation of features in bikeshare? Rank2d.png

Bikeshare? Linearregression? Residualsplot.png use linear regression model to predict

Bikeshare? Ridgecv? AlphaSelection.png use AlphaSelection feature selection

Bikeshare? Ridge? Predictionerror.png

In addition to analyzing and displaying the data directly, Yellowbrick can also process and analyze the text data. Here is a simple description based on the hobby data set. The specific code implementation is as follows:

 1 def hobbiesFreqDistVisualizer():
 2     '''
 3     Text visualization
 4     Token Frequency distribution: how often tokens are drawn in Corpus
 5     t-SNE Corpus visualization: drawing similar documents closer to discovery clustering
 6     '''
 7     corpus=load_corpus("data/hobbies")
 8     vectorizer = CountVectorizer()
 9     docs       = vectorizer.fit_transform(corpus.data)
10     features   = vectorizer.get_feature_names()
11     visualizer = FreqDistVisualizer(features=features)
12     visualizer.fit(docs)
13     visualizer.poof(outpath='text_hobbies_FreqDistVisualizer.png')
14     #Go to stop words
15     vectorizer = CountVectorizer(stop_words='english')
16     docs       = vectorizer.fit_transform(corpus.data)
17     features   = vectorizer.get_feature_names()
18     visualizer = FreqDistVisualizer(features=features)
19     visualizer.fit(docs)
20     visualizer.poof(outpath='text_hobbies_FreqDistVisualizer_stop_words.png')
21 22 23 def hobbiesFreqDistVisualizer2():
24     '''
25     Explore the frequency distribution of cooking and games
26     (Error reporting: no label,Should be corpus.target)
27     '''
28     corpus=load_corpus("data/hobbies")
29     #Distribution statistics of cooking hobby frequency
30     hobbies=defaultdict(list)
31     for text,label in zip(corpus.data,corpus.target):
32         hobbies[label].append(text)
33     vectorizer = CountVectorizer(stop_words='english')
34     docs       = vectorizer.fit_transform(text for text in hobbies['cooking'])
35     features   = vectorizer.get_feature_names()
36     visualizer = FreqDistVisualizer(features=features)
37     visualizer.fit(docs)
38     visualizer.poof(outpath='text_hobbies_cooking_FreqDistVisualizer.png')
39     #Statistical chart of Game Hobby frequency distribution
40     hobbies=defaultdict(list)
41     for text,label in zip(corpus.data, corpus.target):
42         hobbies[label].append(text)
43     vectorizer = CountVectorizer(stop_words='english')
44     docs       = vectorizer.fit_transform(text for text in hobbies['gaming'])
45     features   = vectorizer.get_feature_names()
46     visualizer = FreqDistVisualizer(features=features)
47     visualizer.fit(docs)
48     visualizer.poof(outpath='text_hobbies_gaming_FreqDistVisualizer.png')
49 50 51 def hobbiesTSNEVisualizer():
52     '''
53     t-SNE Corpus visualization
54     T Distributed random neighborhood embedding, t-SNE. Scikit-Learn Implement this decomposition method as sklearn.manifold.TSNE Converter.
55      High dimensional document vectors are decomposed into two dimensions by using different distributions from the original dimension and the decomposed dimension By decomposing into two or three dimensions,
56      You can use scatter plots to display documents.
57     '''
58     corpus=load_corpus("data/hobbies")
59     tfidf  = TfidfVectorizer()
60     docs   = tfidf.fit_transform(corpus.data)
61     labels = corpus.target
62     tsne = TSNEVisualizer()
63     tsne.fit(docs, labels)
64     tsne.poof(outpath='text_hobbies_TSNEVisualizer.png')
65     #Don't color points with their classes
66     tsne = TSNEVisualizer(labels=["documents"])
67     tsne.fit(docs)
68     tsne.poof(outpath='text_hobbies_TSNEVisualizer_nocolor.png')
69 70 71 def hobbiesClusterTSNEVisualizer():
72     '''
73     Clustering Application
74     '''
75     corpus=load_corpus("data/hobbies")
76     tfidf  = TfidfVectorizer()
77     docs   = tfidf.fit_transform(corpus.data)
78     clusters=KMeans(n_clusters=5)
79     clusters.fit(docs)
80     tsne=TSNEVisualizer()
81     tsne.fit(docs,["c{}".format(c) for c in clusters.labels_])
82     tsne.poof(outpath='text_hobbies_cluster_TSNEVisualizer.png')

 

text_hobbies_FreqDistVisualizer.png

text_hobbies_FreqDistVisualizer_stop_words.png

text_hobbies_cooking_FreqDistVisualizer.png

.

Posted by grimz on Thu, 14 Nov 2019 23:41:00 -0800