A complete machine learning project

Keywords: PHP Attribute Python Lambda encoding

Download Data

import os
import tarfile  # Used to compress and decompress files
import urllib

DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml/master/"
HOUSING_PATH = "datasets/housing"
HOUSING_URL = DOWNLOAD_ROOT + HOUSING_PATH + "/housing.tgz"


# Download Data
def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    if not os.path.isdir(housing_path):
        os.makedirs(housing_path)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    # urlretrieve()Method downloads remote data directly to local location
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)  # Unzip the file to the specified path, either to the current path
    housing_tgz.close()
fetch_housing_data()

Loading data

import pandas as pd

def load_housing_data(housing_path=HOUSING_PATH + "/"):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

housing = load_housing_data()
housing.head()

View data structure

info()

The info() method allows you to quickly see the description of the data, especially the total number of rows, the type of each attribute, and the number of non-null values

housing.describe()

housing.info() 
# Analysis: There are 20640 instances in the data set. This is a small amount of data according to machine learning standards, but it is ideal for getting started.
# We noticed that the total number of bedrooms is only 20433 non-empty values, which means that 207 blocks are missing this value.We'll work on it later.

value_counts()

All attributes are numeric except the distance from the sea.It is of type object and can therefore contain any Python object, but since the item is loaded from a CSV file, it must be of type text.When you just looked at the first five items of the data, you may notice that the values in that column are duplicated, meaning that it may be an attribute representing a category.You can use the value_counts() method to see which categories are in the item and how many blocks are in each category:

housing["ocean_proximity"].value_counts()

describe()

describe() method shows a summary of numeric properties

housing.describe()

Graphic description

Use hist() of matplotlib to draw attribute values as a column chart, which is more intuitive

import matplotlib.pyplot as plt
housing.hist(bins=50, figsize=(10,10))
plt.show() # Not necessary

The median age of a house and the median value of a house are also capped, so there is a straight line at the end of the graph.There are two solutions to this situation
- 1 is to re-collect data that is set to go online
- 2 Remove these data from the training set
Some columns have long tails that are too far from the median.This makes the detection rule difficult, so later attempts are made to transform the attributes to make them too distributed.

Create Test Set

At this stage, the data will be split.If you look at the test set, you inadvertently choose a specific machine learning model according to the rules in the test set.When you use a test set to assess the error rate, the evaluation is too optimistic and the actual deployed system will perform poorly.This is called perspective bias.

The following three slicing methods:

1. The following method, run the program again, will result in a different test set.One solution is to save the test set from the first run and load it in a subsequent process.Another method is to set the seed of a random number generator (such as np.random.seed(42)) before calling np.random.permutation() to produce a shuffled indices that are always the same
But it's still not perfect

import numpy as np

def split_train_test(data, test_ratio):
    shuffled_indices = np.random.permutation(len(data)) # permutation Chinese Arrangement, Enter Numbers x，take x Random Scattering of Numbers Within
    test_set_size = int(len(data)*test_ratio)
    test_indices = shuffled_indices[:test_set_size]
    train_indices = shuffled_indices[test_set_size:]
    return data.iloc[train_indices], data.iloc[test_indices]
  
train_set, test_set = split_train_test(housing, 0.2)
print(len(train_set), "train +", len(test_set), "test")

2. Divide by instance's hash value

import hashlib

def test_set_check(identifier, test_ratio, hash):
  return hash(np.int64(identifier)).digest()[-1] < 256 * test_ratio
def split_train_test_by_id(data, test_ratio, id_column, hash=hashlib.md5):
  ids = data[id_column]
  in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio, hash))
  return data.loc[~in_test_set], data.loc[in_test_set]

housing_with_id = housing.reset_index() # adds an `index` column
train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "index")

housing_with_id["id"] = housing["longitude"] * 1000 + housing["latitude"]
train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "id")
print(len(train_set), "train +", len(test_set), "test")

3.sklearn slicing function

Scikit-Learn provides functions to split datasets into multiple subsets in a variety of ways.The simplest function is `train_test_split', which acts much like the previous function `split_train_test', with a few other functions.For example, it has a `random_state'parameter that allows you to set the random generator seed as described earlier.

from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)
print(len(train_set), "train +", len(test_set), "test")

sklearn slicing function 2

train_test_split is pure random sampling and is suitable for large sample sizes.However, if the dataset is small, there is a risk of sampling bias.Layered sampling is performed.You can use Scikit-Learn's StratfiedShuffleSplit class

housing["income_cat"] = np.ceil(housing["median_income"] / 1.5) # ceil Rounding off values (to produce discrete classifications) divided by 1.5 To limit the number of income classifications
housing["income_cat"].where(housing["income_cat"] < 5, 5.0, inplace=True) # Categorize all categories in 5 into Category 5

from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

for train_index, test_index in split.split(housing, housing["income_cat"]):
  strat_train_set = housing.loc[train_index]
  strat_test_set = housing.loc[test_index]
  
# Remember to cull`income_cat`attribute
for set in (strat_train_set, strat_test_set):
  set.drop(["income_cat"], axis=1, inplace=True)

Data exploration, visualization and discovery

Now that you've looked at the data, you need to understand it.
Only study the training set. If the training set is very large, you will need to develop another exploration set to speed up the operation.(Not needed for this example)
Create a copy to avoid damaging the training set

housing = strat_train_set.copy()

Geographic Data Visualization

housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4, s=housing["population"]/100, label="population",
 c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True,)
# The radius of each circle represents the population of the block (option)`s`)，Color represents price (option)`c`)
plt.legend()

Find Associations

Correlation coefficient 1

When the dataset is not very large, it is easy to calculate the standard correlation coefficient (also known as Pearson correlation coefficient) between each pair of attributes using the corr() method.

corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)

Coefficient of correlation 2

Another way to detect correlation coefficients is to use Pandas's scatter_matrix function, which draws a graph of each numeric attribute versus each other.

attributes = ["median_house_value", "median_income", "total_rooms",
 "housing_median_age"]
pd.plotting.scatter_matrix(housing[attributes], figsize=(12, 8))

Attribute Combination Experiment

Some attributes are not useful by themselves, but when combined with other attributes.For example, the number of rooms below and the number of homeowners themselves are not useful, and the number of rooms per household is more useful when divided.So in practice you need a combination of attributes and then compare the correlation coefficients.

housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]
housing["population_per_household"]=housing["population"]/housing["households"]

corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)

Preparing data for machine learning algorithms

Don't do it manually, you need some functions.Reason:

Functions can easily and repeatedly transform data over any dataset
Slowly build a library of functions to reuse in future projects
Easy to try multiple data transformations

The first step is to separate features from labels

housing = strat_train_set.drop("median_house_value", axis=1)
housing_labels = strat_train_set["median_house_value"]

Data cleaning

Missing values (total_bedrooms is used in this example):

Remove the corresponding block dropna()
Remove the entire attribute drop()
Assigning (0, mean, median, etc.) fillna() using this method remembers to save the mean, median, etc., and later test sets are populated

housing.dropna(subset=["total_bedrooms"])  # Option 1
housing.drop("total_bedrooms", axis=1) # Option 2
median = housing["total_bedrooms"].median()
housing["total_bedrooms"].fillna(median) # Option 3

Sckit-learn provides a class to handle missing values: Imputer

from sklearn.preprocessing import Imputer

imputer = Imputer(strategy="median")

# Because only numeric attributes can calculate the median, we need to create a text attribute that does not include text attributes`ocean_proximity`Data copy
housing_num = housing.drop("ocean_proximity", axis=1)

# use`fit()`Method`imputer`Fit instance to training data
imputer.fit(housing_num)

X = imputer.transform(housing_num)
housing_tr = pd.DataFrame(X, columns=housing_num.columns)

Scikit-Learn Design

The API designed by Scikit-Learn is very well designed.Its main design principles are:

Consistency: All objects have consistent and simple interfaces:
- Estimator.Any object that can estimate some parameters based on a dataset is called an estimator (for example, an imputer is an estimator).Estimation itself is a fit() method that requires only one dataset as a parameter (two datasets are required for supervised learning algorithms; the second dataset contains tags).Any other parameter used to guide the estimation process is treated as a hyperparameter (such as the strategy of the imputer), and the hyperparameter is set to an instance variable (usually through the constructor parameter).
- A transformer.Some estimators, such as imputers, can also convert datasets, and these estimators are called transformers.The API is also fairly simple: the transformation is through the transform () method, with the converted dataset as a parameter.The returned dataset is converted.The conversion process relies on the parameters learned, such as the imputer example.All transforms have a convenient method fit_transform(), which is equivalent to calling fit() and then transform() (but sometimes fit_transform() is optimized to run faster).
- Predictor.Finally, some estimators can make predictions based on given data sets, which are called predictors.For example, the LinarRegression model in the previous chapter is a predictor: it predicts life satisfaction based on a country's per capita GDP.The predictor has a predict() method that can make predictions using datasets from new instances.The predictor also has a score() method that can be used to assess the predictive quality of the test set (and, in the case of supervised learning algorithms, the corresponding labels).
Verifiable.The hyperparameters of all estimators can be accessed directly from the instance's public variable (for example, imputer.strategy), and all the estimators can also be accessed by underlining the instance variable name (for example, imputer.statistics_).
Classes are not diffusible.Datasets are represented as NumPy arrays or SciPy sparse matrices rather than homemade classes.Hyperparameters are just plain Python strings or numbers.
Can be combined.Use existing modules whenever possible.For example, any sequence of converters plus an estimator can be used to form a pipeline, as you will see later.
Reasonable defaults.Scikit-Learn provides reasonable defaults for most parameters and makes it easy to create a system.

Working with text and category attributes

LabelEncoder (imperfect)

Scikit-Learn provides a converter LabelEncoder for converting text labels to numbers
It maps tags to 0,1,2,3,4, etc., and the algorithm assumes that 0 and 1 are closer, obviously incorrect.So there's another kind of converter, one-hot converter

from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()
housing_cat = housing["ocean_proximity"]
housing_cat_encoded = encoder.fit_transform(housing_cat)
housing_cat_encoded

OneHotEncoder

The principle is to create a binary attribute, which is 1 (otherwise 0) when the classification is <1H OCEAN, 1 (otherwise 0) when the classification is INLAND, and 1 (otherwise 0), and so on.This is called One-Hot Encoding because only one attribute equals 1 (hot) and the rest is 0 (cold).

Note: fit_transform() is used for a 2D array, and housing_cat_encoded` is a 1D array, so you need to deform it

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder()
housing_cat_1hot = encoder.fit_transform(housing_cat_encoded.reshape(-1,1))
# housing_cat_1hot The result is a sparse matrix, where only nonzero items are stored.This is to save memory when there are many classifications.Convert it to numpy Arrays need to be used toarray function
housing_cat_1hot.toarray()

LabelBinarizer

Using the LabelBinarizer class, we can perform these two transformations in one step (from text classification to integer classification, and from integer classification to heat vector)

Posted by pentinat on Thu, 01 Aug 2019 11:44:37 -0700

Programmer Group