sklearn datasets module learning

Keywords: Python

sklearn.datasets module mainly provides some methods of importing, downloading online and generating datasets locally. You can view them through dir or help command. We will find that there are three main forms: load < dataset  name >, fetch < dataset  name > and make < dataset 

① Datasets. Load < datasets ﹤ name >: small datasets of sklearn package

 

  1. In [2]: datasets.load_*?  
  2. datasets.load_boston#Boston house price data set
  3. datasets.load_breast_cancer#Breast cancer data set
  4. datasets.load_diabetes#Diabetes data set
  5. datasets.load_digits#Handwritten digit data set
  6. datasets.load_files  
  7. datasets.load_iris#Iris data set
  8. datasets.load_lfw_pairs  
  9. datasets.load_lfw_people  
  10. datasets.load_linnerud#Fitness training data set
  11. datasets.load_mlcomp  
  12. datasets.load_sample_image  
  13. datasets.load_sample_images  
  14. datasets.load_svmlight_file  
  15. datasets.load_svmlight_files  

 

The dataset file is under datasets\data in the sklearn installation directory

② Datasets. Fetch < dataset ﹤ name >: large datasets, mainly used for testing and solving practical problems, supporting online download

 

  1. In [3]: datasets.fetch_*?  
  2. datasets.fetch_20newsgroups  
  3. datasets.fetch_20newsgroups_vectorized  
  4. datasets.fetch_california_housing  
  5. datasets.fetch_covtype  
  6. datasets.fetch_kddcup99  
  7. datasets.fetch_lfw_pairs  
  8. datasets.fetch_lfw_people  
  9. datasets.fetch_mldata  
  10. datasets.fetch_olivetti_faces  
  11. datasets.fetch_rcv1  
  12. datasets.fetch_species_distributions  

 

The downloaded data is saved in ~ / scikit ﹣ learn ﹣ data folder by default. You can modify the path by setting the environment variable scikit ﹣ learn ﹣ data, and obtain the download path by datasets. Get ﹣ data ﹣ home()

 

  1. In [5]: datasets.get_data_home()  
  2. Out[5]: 'G:\\datasets'  

 

③ datasets.make?: construct datasets

 

  1. In [4]: datasets.make_*?  
  2. datasets.make_biclusters  
  3. datasets.make_blobs  
  4. datasets.make_checkerboard  
  5. datasets.make_circles  
  6. datasets.make_classification  
  7. datasets.make_friedman1  
  8. datasets.make_friedman2  
  9. datasets.make_friedman3  
  10. datasets.make_gaussian_quantiles  
  11. datasets.make_hastie_10_2  
  12. datasets.make_low_rank_matrix  
  13. datasets.make_moons  
  14. datasets.make_multilabel_classification  
  15. datasets.make_regression  
  16. datasets.make_s_curve  
  17. datasets.make_sparse_coded_signal  
  18. datasets.make_sparse_spd_matrix  
  19. datasets.make_sparse_uncorrelated  
  20. datasets.make_spd_matrix  
  21. datasets.make_swiss_roll  

Take the make ou expression() function as an example. First, look at the function syntax:

 

make_regression(n_samples=100, n_features=100, n_informative=10, n_targets=1, bias=0.0, effective_rank=None, tail_strength=0.5, noise=0.0, shuffle=True, coef=False, random_state=None)

Parameter Description:

n_samples: number of samples

N? Features: number of features (number of independent variables)

N ﹐ informative: the number of relevant features (the number of relevant independent variables) is the number of features involved in building the model

n_targets: number of dependent variables

bias: deviation (intercept)

Coef: output coef ID or not

 

  1. In [7]: data = datasets.make_regression(5,3,2,2,1.0,coef=True)  
  2.    ...: data  
  3.    ...:  
  4. Out[7]:  
  5. (array([[-0.64470031,  2.24028402, -2.26147027],  
  6.         [-0.09554589,  1.4653344 , -0.8882202 ],  
  7.         [-1.36214673,  0.08935031,  0.66733545],  
  8.         [-1.30553824,  1.62553382,  0.65693763],  
  9.         [-0.81528358,  0.81659886,  1.32412053]]),  
  10.  array([[ 177.32114822,  -42.34640341],  
  11.         [ 127.51997766,   -1.98105497],  
  12.         [ -37.82547178, -104.69214796],  
  13.         [ 100.19123506,  -95.62163254],  
  14.         [  45.35860387,  -59.94143654]]),  
  15.  array([[ 34.3135368 ,  77.79161196],  
  16.         [ 88.57943632,   3.03795085],  
  17.         [  0.        ,   0.        ]]))  

The above output results: the three arrays in the tuple correspond to input data X, output data y and coef respectively

Posted by purplehaze on Sun, 05 Apr 2020 05:52:12 -0700