Cs231n Assignment 1 SVM of in-depth learning series

Keywords: Programming Lambda SQL Hadoop

Write at the beginning: Finally I copied the svm job of the big guys again. It's up to me to copy it. I'll attach a link to the big guys who talked better to me in the response location. Thank them.
Recently, facing the pressure of finding an internship, this employment is still very high, what can I do with such dishes, or have to spare time to do a good job in machine learning, and then add a little more in-depth learning framework and SQL and Hadoop learning should be the same, so before September next year, we must train ourselves at least to be able to stand on the level of interns.Later, I'll reorganize the plan and update the content.

Content Arrangement

What I want to share today is a full version of assignment 1's assignment work on the implementation of linear SVM after cs231n. I'll share the formula I used in the middle. This time I'll open the ipynb suffix file of SVM as I did last time. Then I'll use the py file linear_classifier and linear_svm, respectively. This timeThe author will type it like an official document, keep the official English comments and make them where I think they are important. Then I will try to explain this job as clearly as possible by breaking sentences and explaining them.

Start Completing Job

1. Loading data
First, we need to load the data, so we can do it without the tube contents. These two steps are the same as the last KNN.

# Run some setup code for this notebook.

import random
import numpy as np
from cs231n.data_utils import load_CIFAR10
import matplotlib.pyplot as plt



# This is a bit of magic to make matplotlib figures appear inline in the
# notebook rather than in a new window.
%matplotlib inline
plt.rcParams['figure.figsize'] = (10.0, 8.0) # set default size of plots
plt.rcParams['image.interpolation'] = 'nearest'
plt.rcParams['image.cmap'] = 'gray'

# Some more magic so that the notebook will reload external python modules;
# see http://stackoverflow.com/questions/1907993/autoreload-of-modules-in-ipython
%load_ext autoreload
%autoreload 2
# Load the raw CIFAR-10 data.
cifar10_dir = 'cs231n/datasets/cifar-10-batches-py'
X_train, y_train, X_test, y_test = load_CIFAR10(cifar10_dir)

# As a sanity check, we print out the size of the training and test data.
print('Training data shape: ', X_train.shape)
print('Training labels shape: ', y_train.shape)
print('Test data shape: ', X_test.shape)
print('Test labels shape: ', y_test.shape)
Training data shape:  (50000, 32, 32, 3)
Training labels shape:  (50000,)
Test data shape:  (10000, 32, 32, 3)
Test labels shape:  (10000,)
classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']
num_classes = len(classes)
samples_per_class = 7
for y, cls in enumerate(classes):
    idxs = np.flatnonzero(y_train == y)
    idxs = np.random.choice(idxs, samples_per_class, replace=False)
    for i, idx in enumerate(idxs):
        plt_idx = i * num_classes + y + 1
        plt.subplot(samples_per_class, num_classes, plt_idx)
        plt.imshow(X_train[idx].astype('uint8'))
        plt.axis('off')
        if i == 0:
            plt.title(cls)
plt.show()


2. Data Preprocessing
Here we need to divide the data into a training set, a validation set, a test set, and a dev trial set, which is a small sample to test whether the program is working properly.We can't intersect the val test set and the train training set when we select the test set here to achieve a random effect.

# Split the data into train, val, and test sets. In addition we will
# create a small development set as a subset of the training data;
# we can use this for development so our code runs faster.
num_training = 49000
num_validation = 1000
num_test = 1000
num_dev = 500

# Our validation set will be num_validation points from the original
# training set.
mask = range(num_training, num_training + num_validation)
X_val = X_train[mask]
y_val = y_train[mask]

# Our training set will be the first num_train points from the original
# training set.
mask = range(num_training)
X_train = X_train[mask]
y_train = y_train[mask]

# We will also make a development set, which is a small subset of
# the training set.
mask = np.random.choice(num_training, num_dev, replace=False)
X_dev = X_train[mask]
y_dev = y_train[mask]

# We use the first num_test points of the original test set as our
# test set.
mask = range(num_test)
X_test = X_test[mask]
y_test = y_test[mask]

print('Train data shape: ', X_train.shape)
print('Train labels shape: ', y_train.shape)
print('Validation data shape: ', X_val.shape)
print('Validation labels shape: ', y_val.shape)
print('Test data shape: ', X_test.shape)
print('Test labels shape: ', y_test.shape)
Train data shape:  (49000, 32, 32, 3)
Train labels shape:  (49000,)
Validation data shape:  (1000, 32, 32, 3)
Validation labels shape:  (1000,)
Test data shape:  (1000, 32, 32, 3)
Test labels shape:  (1000,)

Next, we need to vectorize the data because our linear svm uses the matrix multiplication of Wx, where each column of W represents all the pixel templates of a category and each row of X represents the vectorized values of all the pixel points of each sample.

# Preprocessing: reshape the image data into rows
X_train = np.reshape(X_train, (X_train.shape[0], -1))
X_val = np.reshape(X_val, (X_val.shape[0], -1))
X_test = np.reshape(X_test, (X_test.shape[0], -1))
X_dev = np.reshape(X_dev, (X_dev.shape[0], -1))

# As a sanity check, print out the shapes of the data
print('Training data shape: ', X_train.shape)
print('Validation data shape: ', X_val.shape)
print('Test data shape: ', X_test.shape)
print('dev data shape: ', X_dev.shape)
Training data shape:  (49000, 3072)
Validation data shape:  (1000, 3072)
Test data shape:  (1000, 3072)
dev data shape:  (500, 3072)

Next, let's show what the average of the pixel points of the training sample looks like.

# Preprocessing: subtract the mean image
# first: compute the image mean based on the training data
mean_image = np.mean(X_train, axis=0)
print(mean_image[:10]) # print a few of the elements
plt.figure(figsize=(4,4))
plt.imshow(mean_image.reshape((32,32,3)).astype('uint8')) # visualize the mean image
plt.show()
[130.64189 135.98174 132.47392 130.0557  135.34804 131.75401 130.96056
 136.14328 132.47636 131.48468]


Then all the training test samples are subtracted by such a pixel mean, the specific operation of this step is not very understood, but the statistical operations that often subtract the mean move the center point, which we can understand from this side, probably in order to make the center point of each dimension 0, and then reduce the degree of fitting, which is equivalent to reducing the degree of eigenvalue, reducing the calculationQuantity, refer to this link article here Differences between de-mean, whitening and centralization.

# second: subtract the mean image from train and test data
X_train -= mean_image
X_val -= mean_image
X_test -= mean_image
X_dev -= mean_image

Finally, add one more item for deviation.

# third: append the bias dimension of ones (i.e. bias trick) so that our SVM
# only has to worry about optimizing a single weight matrix W.
X_train = np.hstack([X_train, np.ones((X_train.shape[0], 1))])
X_val = np.hstack([X_val, np.ones((X_val.shape[0], 1))])
X_test = np.hstack([X_test, np.ones((X_test.shape[0], 1))])
X_dev = np.hstack([X_dev, np.ones((X_dev.shape[0], 1))])

print(X_train.shape, X_val.shape, X_test.shape, X_dev.shape)
(49000, 3073) (1000, 3073) (1000, 3073) (500, 3073)

3. Complete and run SVM
Here, if we run ipynb's SVM code in jupyter directly, we will output an exception because our SVM core code has not been written yet, so we need to open the linear_svm.py file to program the ToDo task. In this file, we have two tasks to complete, the first one is to use loops to program the gradient of w, the second is to use vector momentsThe matrix operation calculates the loss loss function and the gradient of W and outputs it.If a friend doesn't know what a gradient is, it can be interpreted as a partial derivative, so we know that one of the most traditional derivatives is to use the definition of the derivative, but this method requires updating one parameter at a time, which is very slow, so we can use the calculus to get the gradient, and then we assume the full text uses such a loss function.

Here, delta means that when the prediction score function of the correct class is greater than that of the misplaced class, +1, we recognize that there is no loss in the prediction. This function means that when our first sample makes a prediction, each sample of i is predicted as the sum of the score of the wrong class and the correct class, +1, that is, the loss for the first sample.
So to find the gradient for this function, you just need to derive the corresponding w, and you get,

The above formula is a gradient calculation after misclassification, and 1 () is an exponential function that returns 1 if the parentheses condition is satisfied or 0 if it is not, that is, the gradient for the prediction error class is equal to the pixel value of the corresponding x.Then for the gradient of the correct predicted class, we update the weights in the opposite direction when we update the sum of all the losses predicted in class i. The intuitive understanding is that the larger the absolute skewness of the correct class, the stronger the weights need to be, because the prediction is too inaccurate without increasing the weights.

So this is our core formula for calculating the dW gradient. In fact, it is to determine if the score standard is greater than 0. If more than 0 pairs of error classes return Xi pairs of correct classes return -xi, so let's show you how this code is done.

import numpy as np
from random import shuffle
from past.builtins import xrange

def svm_loss_naive(W, X, y, reg):
  """
  Structured SVM loss function, naive implementation (with loops).

  Inputs have dimension D, there are C classes, and we operate on minibatches
  of N examples.

  Inputs:
  - W: A numpy array of shape (D, C) containing weights.
  - X: A numpy array of shape (N, D) containing a minibatch of data.
  - y: A numpy array of shape (N,) containing training labels; y[i] = c means
    that X[i] has label c, where 0 <= c < C.
  - reg: (float) regularization strength

  Returns a tuple of:
  - loss as single float
  - gradient with respect to weights W; an array of same shape as W
  """
  dW = np.zeros(W.shape) # initialize the gradient as zero

  # compute the loss and the gradient
  num_classes = W.shape[1]
  num_train = X.shape[0]
  loss = 0.0
  for i in xrange(num_train):
    scores = X[i].dot(W)
    correct_class_score = scores[y[i]]
    for j in xrange(num_classes):
      if j == y[i]:
        continue
      margin = scores[j] - correct_class_score + 1 # note delta = 1
      if margin > 0:
        loss += margin
        dW[:,j] += X[i].T
        dW[:,y[i]] -= X[i].T
  # Right now the loss is a sum over all training examples, but we want it
  # to be an average instead so we divide by num_train.
  loss /= num_train
  dW /= num_train
  # Add regularization to the loss.
  loss += reg * np.sum(W * W)
  dW += reg * W
  #############################################################################
  # TODO:                                                                     #
  # Compute the gradient of the loss function and store it dW.                #
  # Rather that first computing the loss and then computing the derivative,   #
  # it may be simpler to compute the derivative at the same time that the     #
  # loss is being computed. As a result you may need to modify some of the    #
  # code above to compute the gradient.                                       #
  #############################################################################


  return loss, dW


def svm_loss_vectorized(W, X, y, reg):
  """
  Structured SVM loss function, vectorized implementation.

  Inputs and outputs are the same as svm_loss_naive.
  """
  loss = 0.0
  dW = np.zeros(W.shape) # initialize the gradient as zero

  #############################################################################
  # TODO:                                                                     #
  # Implement a vectorized version of the structured SVM loss, storing the    #
  # result in loss.                                                           #
  #############################################################################
  pass
  num_train = X.shape[0]
  scores = np.dot(X, W)
  correct_class_score = scores[range(num_train), list(y)].reshape(-1, 1)#Become column
  margin = np.maximum(0, scores - correct_class_score + 1)
  margin[range(num_train), list(y)] = 0
  loss = np.sum(margin)/num_train + reg * np.sum(W*W)
  #############################################################################
  #                             END OF YOUR CODE                              #
  #############################################################################


  #############################################################################
  # TODO:                                                                     #
  # Implement a vectorized version of the gradient for the structured SVM     #
  # loss, storing the result in dW.                                           #
  #                                                                           #
  # Hint: Instead of computing the gradient from scratch, it may be easier    #
  # to reuse some of the intermediate values that you used to compute the     #
  # loss.                                                                     #
  #############################################################################
  pass
  num_class = W.shape[1]
  m = np.zeros((num_train, num_class))
  m[margin > 0] = 1
  m[range(num_train), list(y)] = 0
  m[range(num_train), list(y)] = -np.sum(m, axis=1)
  dW = np.dot(X.T, m)/num_train + reg * W

  #############################################################################
  #                             END OF YOUR CODE                              #
  #############################################################################

  return loss, dW

The code accomplishes three tasks that need to be done in this article. It is worth noting that we need to add a reg regular item to both the calculated dw and loss to avoid overfitting them.
So let's run the job code and see what happens.

# Evaluate the naive implementation of the loss we provided for you:
from cs231n.classifiers.linear_svm import svm_loss_naive
import time

# generate a random SVM weight matrix of small numbers
W = np.random.randn(3073, 10) * 0.0001 

loss, grad = svm_loss_naive(W, X_dev, y_dev, 0.000005)
print('loss: %f' % (loss, ))

The first step is to derive the loss function from the svm's general loop calculation, which does not require us to write anything.

loss: 9.035526

Then running the next code will test if the loss function we write is correct. One criterion is to compare the solution we get from the calculus with the solution defined by the derivative. If the difference is small, the calculation is correct.

# Once you've implemented the gradient, recompute it with the code below
# and gradient check it with the function we provided for you
from cs231n.classifiers.linear_svm import svm_loss_naive
import time
# Compute the loss and its gradient at W.
loss, grad = svm_loss_naive(W, X_dev, y_dev, 0.0)

# Numerically compute the gradient along several randomly chosen dimensions, and
# compare them with your analytically computed gradient. The numbers should match
# almost exactly along all dimensions.
from cs231n.gradient_check import grad_check_sparse
f = lambda w: svm_loss_naive(w, X_dev, y_dev, 0.0)[0]
grad_numerical = grad_check_sparse(f, W, grad)

# do the gradient check once again with regularization turned on
# you didn't forget the regularization gradient did you?
loss, grad = svm_loss_naive(W, X_dev, y_dev, 5e1)
f = lambda w: svm_loss_naive(w, X_dev, y_dev, 5e1)[0]
grad_numerical = grad_check_sparse(f, W, grad)
numerical: -11.134769 analytic: -11.134769, relative error: 7.492919e-12
numerical: -26.201969 analytic: -26.201969, relative error: 2.276806e-11
numerical: 17.007888 analytic: 17.007888, relative error: 8.322863e-12
numerical: 24.357402 analytic: 24.357402, relative error: 3.966234e-12
numerical: 11.060589 analytic: 11.060589, relative error: 1.475823e-11
numerical: 14.618717 analytic: 14.618717, relative error: 6.643635e-12
numerical: -28.965377 analytic: -28.965377, relative error: 1.620809e-12
numerical: -12.924270 analytic: -12.924270, relative error: 1.946887e-12
numerical: -37.095280 analytic: -37.095280, relative error: 3.942003e-12
numerical: 16.260672 analytic: 16.260672, relative error: 4.586001e-13
numerical: -35.859216 analytic: -35.862116, relative error: 4.042911e-05
numerical: -0.298664 analytic: -0.288489, relative error: 1.732885e-02
numerical: 20.924849 analytic: 20.921766, relative error: 7.368654e-05
numerical: 15.027211 analytic: 15.025921, relative error: 4.291960e-05
numerical: 6.215818 analytic: 6.223036, relative error: 5.802678e-04
numerical: 0.161299 analytic: 0.156143, relative error: 1.624406e-02
numerical: -9.542728 analytic: -9.541662, relative error: 5.582052e-05
numerical: 3.184572 analytic: 3.178109, relative error: 1.015709e-03
numerical: 7.230887 analytic: 7.239002, relative error: 5.608088e-04
numerical: 37.075271 analytic: 37.067189, relative error: 1.090062e-04

We can see that the loss of numericla and analytic calculations are very close, so we think there is no problem with programming, but there is another problem. Why the results of the two calculations are still slightly different? This problem will be left for analysis in the next series of thinking questions.
Next, we need to use the matrix operation just written to calculate loss and dw functions. In fact, the idea of matrix operation is a whole operation, which integrates for loop into a whole operation, but the details of implementation may require programming attention.

# Next implement the function svm_loss_vectorized; for now only compute the loss;
# we will implement the gradient in a moment.
tic = time.time()
loss_naive, grad_naive = svm_loss_naive(W, X_dev, y_dev, 0.000005)
toc = time.time()
print('Naive loss: %e computed in %fs' % (loss_naive, toc - tic))
print(loss)
Naive loss: 9.138494e+00 computed in 0.092990s
9.15405317783763
from cs231n.classifiers.linear_svm import svm_loss_vectorized
tic = time.time()
loss_vectorized, _ = svm_loss_vectorized(W, X_dev, y_dev, 0.000005)
toc = time.time()
print('Vectorized loss: %e computed in %fs' % (loss_vectorized, toc - tic))
print(loss_vectorized)
# The losses should match but your vectorized implementation should be much faster.
print(loss_naive - loss_vectorized)
Vectorized loss: 9.138494e+00 computed in 0.001954s
9.138493682409688
2.4868995751603507e-14

You can see that the difference between the result of vector calculation and the result of for loop is very small, which may be caused by the different storage precision.Of course, we compare loss better because we use the F-norm to compare the gradients calculated by the two methods for a single value.

# Complete the implementation of svm_loss_vectorized, and compute the gradient
# of the loss function in a vectorized way.

# The naive implementation and the vectorized implementation should match, but
# the vectorized version should still be much faster.
tic = time.time()
_, grad_naive = svm_loss_naive(W, X_dev, y_dev, 0.000005)
toc = time.time()

print('Naive loss and gradient: computed in %fs' % (toc - tic))

tic = time.time()
_, grad_vectorized = svm_loss_vectorized(W, X_dev, y_dev, 0.000005)
toc = time.time()

print('Vectorized loss and gradient: computed in %fs' % (toc - tic))

# The loss is a single number, so it is easy to compare the values computed
# by the two implementations. The gradient on the other hand is a matrix, so
# we use the Frobenius norm to compare them.
difference = np.linalg.norm(grad_naive - grad_vectorized, ord='fro')
print('difference: %f' % difference)
Vectorized loss and gradient: computed in 0.015625s
difference: 0.000000

You can see that there is also little difference, and vector operations are computed faster.Of course, even if the speed of calculation is faster, when we face a lot of data, if we calculate all the data each time, or book updates the weight through all the data, it will undoubtedly greatly increase the burden of calculation. To solve this problem, we use SGD, which is also called random gradient descent. The difference between this method and gradient descent is random. The main idea is as follows:It's the idea of randomly removing 200 rows of data from 5,000 rows (numpy.random.choice implements this idea) and iterating over local data, then selecting 2,000 times so that you can quickly calculate and update weights with small-scale data, then you'll move on to the train function in another file, linear_classifier.py, filled with random gradientsThe downward core function, and all we need to do is implement random data collection.Then there is one more thing to do to predict the categories of our train data based on the final W.The code is as follows.

from __future__ import print_function

import numpy as np
from cs231n.classifiers.linear_svm import *
from cs231n.classifiers.softmax import *
from past.builtins import xrange


class LinearClassifier(object):

  def __init__(self):
    self.W = None

  def train(self, X, y, learning_rate=1e-3, reg=1e-5, num_iters=100,
            batch_size=200, verbose=False):
    """
    Train this linear classifier using stochastic gradient descent.

    Inputs:
    - X: A numpy array of shape (N, D) containing training data; there are N
      training samples each of dimension D.
    - y: A numpy array of shape (N,) containing training labels; y[i] = c
      means that X[i] has label 0 <= c < C for C classes.
    - learning_rate: (float) learning rate for optimization.
    - reg: (float) regularization strength.
    - num_iters: (integer) number of steps to take when optimizing
    - batch_size: (integer) number of training examples to use at each step.
    - verbose: (boolean) If true, print progress during optimization.

    Outputs:
    A list containing the value of the loss function at each training iteration.
    """
    num_train, dim = X.shape
    num_classes = np.max(y) + 1 # assume y takes values 0...K-1 where K is number of classes
    if self.W is None:
      # lazily initialize W
      self.W = 0.001 * np.random.randn(dim, num_classes)

    # Run stochastic gradient descent to optimize W
    loss_history = []
    for it in xrange(num_iters):
      index = np.random.choice(num_train, batch_size, replace=True)
      X_batch = X[index]
      y_batch = y[index]

      #########################################################################
      # TODO:                                                                 #
      # Sample batch_size elements from the training data and their           #
      # corresponding labels to use in this round of gradient descent.        #
      # Store the data in X_batch and their corresponding labels in           #
      # y_batch; after sampling X_batch should have shape (dim, batch_size)   #
      # and y_batch should have shape (batch_size,)                           #
      #                                                                       #
      # Hint: Use np.random.choice to generate indices. Sampling with         #
      # replacement is faster than sampling without replacement.              #
      #########################################################################
      pass
      #########################################################################
      #                       END OF YOUR CODE                                #
      #########################################################################

      # evaluate loss and gradient
      loss, grad = self.loss(X_batch, y_batch, reg)
      loss_history.append(loss)
      self.W += -learning_rate * grad
      # perform parameter update
      #########################################################################
      # TODO:                                                                 #
      # Update the weights using the gradient and the learning rate.          #
      #########################################################################
      pass
      #########################################################################
      #                       END OF YOUR CODE                                #
      #########################################################################

      if verbose and it % 100 == 0:
        print('iteration %d / %d: loss %f' % (it, num_iters, loss))

    return loss_history

  def predict(self, X):
    """
    Use the trained weights of this linear classifier to predict labels for
    data points.

    Inputs:
    - X: A numpy array of shape (N, D) containing training data; there are N
      training samples each of dimension D.

    Returns:
    - y_pred: Predicted labels for the data in X. y_pred is a 1-dimensional
      array of length N, and each element is an integer giving the predicted
      class.
    """
    y_pred = np.zeros(X.shape[0])
    scores = np.dot(X, self.W)
    y_pred = np.argmax(scores, axis=1)
    ###########################################################################
    # TODO:                                                                   #
    # Implement this method. Store the predicted labels in y_pred.            #
    ###########################################################################
    pass
    ###########################################################################
    #                           END OF YOUR CODE                              #
    ###########################################################################
    return y_pred
  
  def loss(self, X_batch, y_batch, reg):
    """
    Compute the loss function and its derivative. 
    Subclasses will override this.

    Inputs:
    - X_batch: A numpy array of shape (N, D) containing a minibatch of N
      data points; each point has dimension D.
    - y_batch: A numpy array of shape (N,) containing labels for the minibatch.
    - reg: (float) regularization strength.

    Returns: A tuple containing:
    - loss as a single float
    - gradient with respect to self.W; an array of the same shape as W
    """
    pass


class LinearSVM(LinearClassifier):
  """ A subclass that uses the Multiclass SVM loss function """

  def loss(self, X_batch, y_batch, reg):
    return svm_loss_vectorized(self.W, X_batch, y_batch, reg)


class Softmax(LinearClassifier):
  """ A subclass that uses the Softmax + Cross-entropy loss function """

  def loss(self, X_batch, y_batch, reg):
    return softmax_loss_vectorized(self.W, X_batch, y_batch, reg)

Let's run the job code and see the results.

# In the file linear_classifier.py, implement SGD in the function
# LinearClassifier.train() and then run it with the code below.

from cs231n.classifiers import LinearSVM
svm = LinearSVM()
tic = time.time()
loss_hist = svm.train(X_train, y_train, learning_rate=1e-7, reg=2.5e4,
                      num_iters=2000, verbose=True)
toc = time.time()
print('That took %fs' % (toc - tic))
iteration 0 / 2000: loss 785.193662
iteration 100 / 2000: loss 469.951361
iteration 200 / 2000: loss 285.152670
iteration 300 / 2000: loss 173.940086
iteration 400 / 2000: loss 106.734203
iteration 500 / 2000: loss 66.067783
iteration 600 / 2000: loss 41.010264
iteration 700 / 2000: loss 27.581060
iteration 800 / 2000: loss 18.393378
iteration 900 / 2000: loss 12.956286
iteration 1000 / 2000: loss 10.400926
iteration 1100 / 2000: loss 8.312311
iteration 1200 / 2000: loss 6.956938
iteration 1300 / 2000: loss 6.775997
iteration 1400 / 2000: loss 5.950798
iteration 1500 / 2000: loss 5.256139
iteration 1600 / 2000: loss 5.524237
iteration 1700 / 2000: loss 5.749462
iteration 1800 / 2000: loss 6.222899
iteration 1900 / 2000: loss 5.239182
That took 9.655279s

You can see that as the number of iterations increases, the overall loss function decreases, that is, the prediction accuracy increases.Let's take a look at the drawing below.

# A useful debugging strategy is to plot the loss as a function of
# iteration number:
plt.plot(loss_hist)
plt.xlabel('Iteration number')
plt.ylabel('Loss value')
plt.show()


So you can see that as the number of iterations increases, the size of the loss function decreases rapidly and tends to stabilize. Let's see how accurate the predictions for train and val data are.

# Write the LinearSVM.predict function and evaluate the performance on both the
# training and validation set
y_train_pred = svm.predict(X_train)
print('training accuracy: %f' % (np.mean(y_train == y_train_pred), ))
y_val_pred = svm.predict(X_val)
print('validation accuracy: %f' % (np.mean(y_val == y_val_pred), ))
training accuracy: 0.376714
validation accuracy: 0.388000

Accuracy at 0.37 and 0.38 is generally not good, but for linear svm the results are OK.Finally, cross-validation is used to select the best parameters. Here we train train data with different parameters, then validate the data, and select the most accurate set of parameters to be substituted for test to output the final result.
The last task here is to complete the code for cross-validation. The main idea is to cycle through different parameters to make predictions, then keep updating the most accurate parameters.The code is as follows.

# Use the validation set to tune hyperparameters (regularization strength and
# learning rate). You should experiment with different ranges for the learning
# rates and regularization strengths; if you are careful you should be able to
# get a classification accuracy of about 0.4 on the validation set.

learning_rates = [1.4e-7, 1.5e-7, 1.6e-7]
regularization_strengths = [8000.0, 9000.0, 10000.0, 11000.0, 18000.0, 19000.0, 20000.0, 21000.0]

# results is dictionary mapping tuples of the form
# (learning_rate, regularization_strength) to tuples of the form
# (training_accuracy, validation_accuracy). The accuracy is simply the fraction
# of data points that are correctly classified.
results = {}
best_lr = None
best_reg = None
best_val = -1   # The highest validation accuracy that we have seen so far.
best_svm = None # The LinearSVM object that achieved the highest validation rate.
for lr in learning_rates:
    for reg in regularization_strengths:
        svm = LinearSVM()
        loss_history = svm.train(X_train, y_train, learning_rate = lr, reg = reg, num_iters = 5000)
        y_train_pred = svm.predict(X_train)
        accuracy_train = np.mean(y_train_pred == y_train)
        y_val_pred = svm.predict(X_val)
        accuracy_val = np.mean(y_val_pred == y_val)
        if accuracy_val > best_val:
            best_lr = lr
            best_reg = reg
            best_val = accuracy_val
            best_svm = svm
        results[(lr, reg)] = accuracy_train, accuracy_val
        
################################################################################
# TODO:                                                                        #
# Write code that chooses the best hyperparameters by tuning on the validation #
# set. For each combination of hyperparameters, train a linear SVM on the      #
# training set, compute its accuracy on the training and validation sets, and  #
# store these numbers in the results dictionary. In addition, store the best   #
# validation accuracy in best_val and the LinearSVM object that achieves this  #
# accuracy in best_svm.                                                        #
#                                                                              #
# Hint: You should use a small value for num_iters as you develop your         #
# validation code so that the SVMs don't take much time to train; once you are #
# confident that your validation code works, you should rerun the validation   #
# code with a larger value for num_iters.                                      #
################################################################################
pass
################################################################################
#                              END OF YOUR CODE                                #
################################################################################
    
# Print out results.
for lr, reg in sorted(results):
    train_accuracy, val_accuracy = results[(lr, reg)]
    print('lr %e reg %e train accuracy: %f val accuracy: %f' % (
                lr, reg, train_accuracy, val_accuracy))
    
print('best validation accuracy achieved during cross-validation: %f' % best_val)
lr 1.400000e-07 reg 8.000000e+03 train accuracy: 0.399837 val accuracy: 0.402000
lr 1.400000e-07 reg 9.000000e+03 train accuracy: 0.391000 val accuracy: 0.396000
lr 1.400000e-07 reg 1.000000e+04 train accuracy: 0.392224 val accuracy: 0.393000
lr 1.400000e-07 reg 1.100000e+04 train accuracy: 0.391837 val accuracy: 0.392000
lr 1.400000e-07 reg 1.800000e+04 train accuracy: 0.383347 val accuracy: 0.393000
lr 1.400000e-07 reg 1.900000e+04 train accuracy: 0.377673 val accuracy: 0.373000
lr 1.400000e-07 reg 2.000000e+04 train accuracy: 0.383551 val accuracy: 0.386000
lr 1.400000e-07 reg 2.100000e+04 train accuracy: 0.383347 val accuracy: 0.385000
lr 1.500000e-07 reg 8.000000e+03 train accuracy: 0.394367 val accuracy: 0.390000
lr 1.500000e-07 reg 9.000000e+03 train accuracy: 0.392653 val accuracy: 0.393000
lr 1.500000e-07 reg 1.000000e+04 train accuracy: 0.393878 val accuracy: 0.402000
lr 1.500000e-07 reg 1.100000e+04 train accuracy: 0.385612 val accuracy: 0.365000
lr 1.500000e-07 reg 1.800000e+04 train accuracy: 0.382837 val accuracy: 0.393000
lr 1.500000e-07 reg 1.900000e+04 train accuracy: 0.377633 val accuracy: 0.386000
lr 1.500000e-07 reg 2.000000e+04 train accuracy: 0.383571 val accuracy: 0.383000
lr 1.500000e-07 reg 2.100000e+04 train accuracy: 0.384469 val accuracy: 0.382000
lr 1.600000e-07 reg 8.000000e+03 train accuracy: 0.393816 val accuracy: 0.390000
lr 1.600000e-07 reg 9.000000e+03 train accuracy: 0.394673 val accuracy: 0.394000
lr 1.600000e-07 reg 1.000000e+04 train accuracy: 0.392776 val accuracy: 0.392000
lr 1.600000e-07 reg 1.100000e+04 train accuracy: 0.387388 val accuracy: 0.383000
lr 1.600000e-07 reg 1.800000e+04 train accuracy: 0.383143 val accuracy: 0.393000
lr 1.600000e-07 reg 1.900000e+04 train accuracy: 0.381224 val accuracy: 0.390000
lr 1.600000e-07 reg 2.000000e+04 train accuracy: 0.379959 val accuracy: 0.387000
lr 1.600000e-07 reg 2.100000e+04 train accuracy: 0.375388 val accuracy: 0.397000
best validation accuracy achieved during cross-validation: 0.402000
# Visualize the cross-validation results
import math
x_scatter = [math.log10(x[0]) for x in results]
y_scatter = [math.log10(x[1]) for x in results]

# plot training accuracy
marker_size = 100
colors = [results[x][0] for x in results]
plt.subplot(2, 1, 1)
plt.scatter(x_scatter, y_scatter, marker_size, c=colors)
plt.colorbar()
plt.xlabel('log learning rate')
plt.ylabel('log regularization strength')
plt.title('CIFAR-10 training accuracy')

# plot validation accuracy
colors = [results[x][1] for x in results] # default size of markers is 20
plt.subplot(2, 1, 2)
plt.scatter(x_scatter, y_scatter, marker_size, c=colors)
plt.colorbar()
plt.xlabel('log learning rate')
plt.ylabel('log regularization strength')
plt.title('CIFAR-10 validation accuracy')
plt.show()


The fluctuations of prediction accuracy under different parameters are shown. Let's take a look at the prediction accuracy of test using the optimal parameters.

# Evaluate the best svm on test set
y_test_pred = best_svm.predict(X_test)
test_accuracy = np.mean(y_test == y_test_pred)
print('linear SVM on raw pixels final test set accuracy: %f' % test_accuracy)
linear SVM on raw pixels final test set accuracy: 0.377000

The prediction accuracy is 0.37, which is similar to the training set. As to whether the prediction is more accurate or not, we can see from the previous loss function graph that we need to increase a large number of iterations in the future to improve the accuracy slightly, so it is not cost-effective, so we think this prediction result is reasonable.
Finally, we will show the W we trained visually to see what we trained after half a day. The explanation course for W says that it is a template for each category. When a sample is entered and multiplied with the template to get a corresponding score, the high score indicates that you fit all parts of the template well.

# Visualize the learned weights for each class.
# Depending on your choice of learning rate and regularization strength, these may
# or may not be nice to look at.
w = best_svm.W[:-1,:] # strip out the bias
w = w.reshape(32, 32, 3, 10)
w_min, w_max = np.min(w), np.max(w)
classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']
for i in range(10):
    plt.subplot(2, 5, i + 1)
      
    # Rescale the weights to be between 0 and 255
    wimg = 255.0 * (w[:, :, :, i].squeeze() - w_min) / (w_max - w_min)
    plt.imshow(wimg.astype('uint8'))
    plt.axis('off')
    plt.title(classes[i])


This picture can show that horse is the two-headed horse that we talked about in the class.

epilogue
This time, we have a preliminary understanding of the classification implementation of linear svm, and we will continue to update this series later.
Thank you for reading.
Reference resources
cs231n svm course assignments red stone

Twenty-four original articles were published. 100 were praised. 4361 were visited
Private letter follow

Posted by cornelombaard on Fri, 13 Mar 2020 18:51:52 -0700