Scikit-Learn Learning Diary (V) Model Selection: Selection of Estimated Model Parameters

Time: The evening of March 25, 2017

Model score and cross-validation

Choose an algorithm to fit a data set. How to see how well the algorithm fits the data set and whether the algorithm fits the data set? Score can be used to get the score of the trained model. Of course, the higher the score, the better. For example, the following SVM model.

In [1]: from sklearn import datasets, svm
In [2]: digits = datasets.load_digits()
In [3]: X_digits = digits.data
In [4]: y_digits = digits.target
In [5]: svc = svm.SVC(C=1, kernel='linear')
In [6]: svc.fit(X_digits[:-100], y_digits[:-100]).score(X_digits[-100:], y_digits[-100:])
Out[6]: 0.97999999999999998

The score of this model is 0.98. You can see that the fitting is very good. Of course, this model uses the first 1697 data to train, and the last 100 data to test the score.

Sometimes, in order to obtain better prediction accuracy, data sets can be segmented, used for training and testing, and constantly optimize the fitting model.

In [8]: import numpy as np
In [9]: X_folds = np.array_split(X_digits, 3)
In [10]: y_folds = np.array_split(y_digits, 3)
In [11]: scores = list()
In [12]: for k in range(3):
    ...:     X_train = list(X_folds)
    ...:     X_test  = X_train.pop(k)
    ...:     X_train = np.concatenate(X_train)
    ...:     y_train = list(y_folds)
    ...:     y_test  = y_train.pop(k)
    ...:     y_train = np.concatenate(y_train)
    ...:     scores.append(svc.fit(X_train, y_train).score(X_test, y_test))
    ...: print(scores)
    ...: 
[0.93489148580968284, 0.95659432387312182, 0.93989983305509184]

The data set is divided into three parts, one for each test data, and the other two parts are fused into a matrix for training data to train the algorithm, and then the scores are verified. The function of concatenate() function is fusion.
This is KFold cross validation.

II. Cross-validation

Sckit-learner collects some popular methods for dividing data sets into training and test sets.

This paper introduces a split method, which receives a data set, divides the data into training set and test set, and can be accessed by index.

In [5]: from sklearn.model_selection import KFold, cross_val_score
In [6]: X = ["a", "a", "b", "c", "c", "c"]
In [7]: k_fold = KFold(n_splits=3)
In [8]: for train_indices, test_indices in k_fold.split(X):
   ...:     print('Train: %s | test: %s' % (train_indices, test_indices))
   ...: 
Train: [2 3 4 5] | test: [0 1]
Train: [0 1 4 5] | test: [2 3]
Train: [0 1 2 3] | test: [4 5]

The split method is equivalent to the average data, and it does not need code to average the data and save it, so it is more convenient. n_splits defaults to 3, with a minimum of 2. The sample code is as follows:

In [10]: kfold = KFold(n_splits=3)
In [11]: svc = svm.SVC(C=1, kernel='linear')
In [12]: [svc.fit(X_digits[train], y_digits[train]).score(X_digits[test], y_digits[test])
    ...: ...          for train, test in k_fold.split(X_digits)]
Out[12]: [0.93489148580968284, 0.95659432387312182, 0.93989983305509184]

The cross-validation scores can be calculated directly with the help of cross_val_score. Given a classifier, cross_val_score iteratively cuts down the data set and takes part of it as training set and part as data set. The grading test is carried out iteratively in the last iteration of the classifier.
A reference example of cross_val_score is as follows:

In [15]: cross_val_score(svc, X_digits, y_digits, cv=k_fold, n_jobs=-1)
Out[15]: array([ 0.93489149,  0.95659432,  0.93989983])

n_jobs=-1 means that the calculation is applied to all CPUs of the computer.
In addition, the method of calculating the score can also be specified. The code is as follows:

In [16]: cross_val_score(svc, X_digits, y_digits, cv=k_fold,
    ...: ...                 scoring='precision_macro')
Out[16]: array([ 0.93969761,  0.95911415,  0.94041254])

You can see that it's different from the last score.

For cross-validation, see the blog: http://blog.csdn.net/cherdw/article/details/54986863

3. Grid search and cross-validation

Grid search:

Sckit-Learn provides a method for calculating grid parameters for classifiers and choosing an appropriate parameter to maximize the score of cross-validation of classifiers. This method needs a classifier and API when it is built.

In [22]: from sklearn.model_selection import GridSearchCV, cross_val_score
In [23]: Cs = np.logspace(-6, -1, 10)
In [24]: clf = GridSearchCV(estimator=svc, param_grid=dict(C=Cs),
    ...: ...                    n_jobs=-1)
In [25]: clf.fit(X_digits[:1000], y_digits[:1000])
Out[25]: 
GridSearchCV(cv=None, error_score='raise',
       estimator=SVC(C=1, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False),
       fit_params={}, iid=True, n_jobs=-1,
       param_grid={'C': array([  1.00000e-06,   3.59381e-06,   1.29155e-05,   4.64159e-05,
         1.66810e-04,   5.99484e-04,   2.15443e-03,   7.74264e-03,
         2.78256e-02,   1.00000e-01])},
       pre_dispatch='2*n_jobs', refit=True, return_train_score=True,
       scoring=None, verbose=0)
In [26]: clf.best_score_
Out[26]: 0.92500000000000004
In [27]: clf.best_estimator_.C
Out[27]: 0.0077426368268112772
In [28]: clf.score(X_digits[1000:], y_digits[1000:])
Out[28]: 0.94353826850690092

By default, GridSearchCV uses triple cross validation. However, if he detects that a classifier passes through, rather than an independent variable, it uses three times the hierarchy.
Nested cross validation:

In [29]: cross_val_score(clf, X_digits, y_digits)
Out[29]: array([ 0.93853821,  0.96327212,  0.94463087])

Setting parameters in cross validation can make the algorithm more effective. In some algorithms, we can set parameters to correct automatically.

In [30]: from sklearn import linear_model, datasets
In [31]: lasso = linear_model.LassoCV()
In [32]: diabetes = datasets.load_diabetes()
In [33]: X_diabetes = diabetes.data
In [34]: y_diabetes = diabetes.target

In [35]: lasso.fit(X_diabetes, y_diabetes)
Out[35]: 
LassoCV(alphas=None, copy_X=True, cv=None, eps=0.001, fit_intercept=True,
    max_iter=1000, n_alphas=100, n_jobs=1, normalize=False, positive=False,
    precompute='auto', random_state=None, selection='cyclic', tol=0.0001,
    verbose=False)
In [36]: lasso.alpha_
Out[36]: 0.012291895087486173

Time: The evening of March 25, 2017

Mengxin self-study, personal opinion.

Posted by simonsays on Mon, 15 Jul 2019 12:17:59 -0700

Programmer Group

Scikit-Learn Learning Diary (V) Model Selection: Selection of Estimated Model Parameters

Hot Keywords