Recommendation Engine for SparkML (2) - Evaluation of Recommendation Model

Keywords: Spark Apache SQL Scala

The content and code for this article follow Last article To write, we recommend that you take a look at Ha~.
We wrote the implementation of the movie recommendation in the last article, but is the recommendation reasonable? This requires us to evaluate the model.
For the recommended models, the models are evaluated based on the mean square deviation and the average accuracy of the K values, and MLlib provides built-in functions for each of these evaluation methods

In fact, it is necessary to continuously select different values for the three key parameters rank, iterations, lambda of the recommended model, and then evaluate the models generated by the different parameters to select the best model.

Here are two methods for evaluating recommended models ~

1. Mean Square Variance (MSE) and Root Mean Square Error (RMSE)
Definition: The sum of squared errors is the quotient of the total number.It can be interpreted as the squared difference between the predicted rating and the true rating.
The root mean square error is also widely used, and its calculation only needs to take the square root on MSE~

The evaluation code is:

//Format:(userID,Film)
val userProducts: RDD[(Int, Int)] = ratings.map(rating => (rating.user, rating.product))
//The scoring information inferred by the model is in the following format:((userID,Film), Presumption Score)
val predictions: RDD[((Int, Int), Double)] = model.predict(userProducts).map(rating => ((rating.user, rating.product),rating.rating))
//The format is:((userID,Film), (True Flat Score, Presumptive Score))
val ratingsAndPredictions: RDD[((Int, Int), (Double, Double))] = ratings.map(rating => ((rating.user, rating.product), rating.rating))
                                                                        .join(predictions)
//Mean Variance
val MSE = ratingsAndPredictions.map(rap => math.pow(rap._2._1 - rap._2._2, 2)).reduce(_+_) / ratingsAndPredictions.count()
println("MSE: " + MSE)
//Root mean square error
val RMSE: Double = math.sqrt(MSE)
println("RMSE: " + RMSE)

This is calculated by ourselves, or by the functions built into MLlib:

import org.apache.spark.mllib.evaluation.{RegressionMetrics, RankingMetrics}
val predictedAndTrue: RDD[(Double, Double)] = ratingsAndPredictions.map{ case((userID, product),(actual, predict)) => (actual, predict)}
val regressionMetrics: RegressionMetrics = new RegressionMetrics(predictedAndTrue)
println("MSE: " + regressionMetrics.meanSquaredError)
println("RMSE: " + regressionMetrics.rootMeanSquaredError)

The output is:

MSE: 0.08231947642632852
RMSE: 0.2869137090247319

2. Average accuracy of K values (MAPK)
The mean K-value accuracy (MAPK) means the average K-value accuracy (APK) over the entire dataset.APK is a common index in information retrieval.It measures the average correlation of the "first K" documents returned against a query.
The higher the actual relevance of the documents in the results and the higher the ranking, the higher the APK score.Naturally, this model is better if the item with a higher score in the predictions (and a higher rank in the recommendation list) is actually more relevant to the user.

ok, the MAPK evaluation code is as follows:

package ml

import org.apache.spark.mllib.evaluation.RankingMetrics
import org.apache.spark.mllib.recommendation.{Rating, ALS}
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkContext, SparkConf}
import org.jblas.DoubleMatrix
import sql.StreamingExamples
import scala.collection.Map

object MAPKTest{
  def main(args: Array[String]) {
    StreamingExamples.setStreamingLogLevels()
    val conf = new SparkConf().setAppName("MAPKTest").setMaster("local[*]")
    val sc = new SparkContext(conf)
    /*User movie rating*/
    val rawData: RDD[String] = sc.textFile("file:///E:/spark/ml-100k/u.data")
    //Remove the time field, rawRatings:Array
    val rawRatings = rawData.map(_.split("\\t").take(3))
    //user moive rating
    val ratings = rawRatings.map{case Array(user, movie, rating) =>{
      Rating(user.toInt, movie.toInt, rating.toDouble)
    }}
    /**
      * Training Model
      * Note: 50 represents the number of columns of factors for our model, called factor dimensions
      */
    val model = ALS.train(ratings, 50, 10, 0.01)

    /*Get the factor of all the goods in the model and convert it to a matrix*/
    val itemFactors: Array[Array[Double]] = model.productFeatures.map{case (id, factor) => factor}.collect()
    val itemMatrix: DoubleMatrix = new DoubleMatrix(itemFactors)
//    println(itemMatrix.rows, itemMatrix.columns)

    /*Get the ratings for each movie for each user in the model*/
    val allRecs = model.userFeatures.map{ case(userId, factor) => {
      val userVector = new DoubleMatrix(factor)
      /**
        * socres A Vector of type DoubleMatrix with a value of 1 row and N column
        * Why can scores be judged by determining the size of the product of these two matrices?
        * This is due to the ALS algorithm, which splits a user-commodity matrix into two matrices: user-commodity matrix and commodity matrix.
        * So the product of these two matrices is the actual fraction
        */
      val scores = itemMatrix.mmul(userVector)//The product of a matrix and a vector, calculating the score for each user
      //Sort by reciprocal of score
      val sortedWithId = scores.data.zipWithIndex.sortBy(-_._1)
      //(score, itemId)
      val recommendIds = sortedWithId.map(_._2 + 1).toSeq
      //Returns the reciprocal of the user's and individual item ratings. tuple: (userId,(sorce, itemId))
      (userId, recommendIds)
    }}

    /*Get the rating for each rated movie for each user in the real world*/
    val userMoives: RDD[(Int, Iterable[(Int, Int)])] = ratings.map{ case Rating(user, product, rating) => {
      (user, product)
    }}.groupBy(_._1)

    val predictedAndTrueForRanking = allRecs.join(userMoives).map{ case( userId, (predicted, actualWithIds) ) => {
      //Actual commodity number
      val actual = actualWithIds.map(_._2)
      (actual.toArray, predicted.toArray)
    }}
    val rankingMetrics: RankingMetrics[Int] = new RankingMetrics(predictedAndTrueForRanking)
    println("Use built-in computing MAP: " + rankingMetrics.meanAveragePrecision)
  }

The output is:

Use built-in computed MAP: 0.0630466936422453
3. Recommend model complete code
package ml

import org.apache.spark.mllib.evaluation.{RegressionMetrics, RankingMetrics}
import org.apache.spark.mllib.recommendation.{Rating, ALS}
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkContext, SparkConf}
import org.jblas.DoubleMatrix
import sql.StreamingExamples
import scala.collection.Map

/**
  * Recommended algorithm based on Park MLlib
  * ALS: least square method
  *
  * @author lwj
  * @date 2018/05/04
  */
object Recommend{
  /**
    * For commodity recommendation
    * Returns the cosine similarity between two vectors by passing in two vectors
    *
    * @param vec1
    * @param vec2
    * @return
    */
  def cosineSimilarity(vec1: DoubleMatrix, vec2: DoubleMatrix): Double = {
    vec1.dot(vec2) / (vec1.norm2() * vec2.norm2())
  }

  /**
    * Model evaluation
    * K Average Value Accuracy (APK)
    *
    * @param actual
    * @param predicted
    * @param k
    * @return
    */
  def avgPrecisionK(actual: Seq[Int], predicted: Seq[Int], k: Int) : Double = {
    val predK: Seq[Int] = predicted.take(k)
    var score = 0.0
    var numHits = 0.0
    for ((p, i) <- predK.zipWithIndex){
      if (actual.contains(p)){
        numHits += 1.0
        score += numHits / (i.toDouble + 1.0) //TODO Why divide by i.toDouble
      }
    }
    if (actual.isEmpty){
      1.0
    }else{
      score / math.min(actual.size, k).toDouble //TODO Why is it min
    }
  }


  def main(args: Array[String]) {
    StreamingExamples.setStreamingLogLevels()
    val conf = new SparkConf().setAppName("recommandTest").setMaster("local[*]")
    val sc = new SparkContext(conf)
    /*User movie rating*/
    val rawData: RDD[String] = sc.textFile("file:///E:/spark/ml-100k/u.data")
    //Remove the time field, rawRatings:Array
    val rawRatings = rawData.map(_.split("\\t").take(3))
    //user moive rating
    val ratings = rawRatings.map{case Array(user, movie, rating) =>{
      Rating(user.toInt, movie.toInt, rating.toDouble)
    }}
    //Film
    val movies: RDD[String] = sc.textFile("file:///E:/spark/ml-100k/u.item")
    //Film ID Movie Title
    val titles: Map[Int, String] = movies.map(_.split("\\|").take(2)).map(array => (array(0).toInt, array(1))).collectAsMap()
    /**
      * Training Model
      * Note: 50 represents the number of columns of factors for our model, called factor dimensions
      */
    val model = ALS.train(ratings, 50, 10, 0.01)

    /**
      * Recommendation based on user
      */
    //Number of user factors
    //  println(mode.userFeatures.count())
    //Number of commodity factors
    //  println(mode.productFeatures.count())
    //View a user's forecast rating for a commodity. ALS Initialization of the model is random, so the results may be different
    //  println(mode.predict(789, 123))

    //Recommended for specified users N Goods
    val userID = 789
    val K = 10
    val topKRecs: Array[Rating] = model.recommendProducts(userID, 10)
    //  println(topKRecs.mkString("\n"))

    //Get the movie rated by the specified user
    val moviesForUser: Seq[Rating] = ratings.keyBy(_.user).lookup(789)

    //Print out the names and ratings of the top 10 movies rated by a given user
    println("Actual:")
    moviesForUser.sortBy(-_.rating).take(10).map(rating => {
      (titles(rating.product),rating.rating)
    }).foreach(println)

    //Print out the names and ratings of the 10 movies recommended to users and compare them with the above
    println("Recommended:")
    topKRecs.map(rating => {
      (titles(rating.product),rating.rating)
    }).foreach(println)


    println("\n-----------------------\n")

    /**
      * Recommendation based on merchandise
      */
    /*Get goods similar to this one through the commodity ID*/
    val itemId = 567
    val itemFactor: Array[Double] = model.productFeatures.lookup(itemId).head
    val itemVector: DoubleMatrix = new DoubleMatrix(itemFactor)
    //Get the cosine similarity of each item to the given item
    val sims = model.productFeatures.map{case (id, factor) => {
      val factorVector = new DoubleMatrix(factor)
      val sim = cosineSimilarity(factorVector, itemVector)
      (id, sim)
    }}
    //Before Printing N Commodities
    val topItem: Array[(Int, Double)] = sims.sortBy(-_._2).take(10 + 1)
    println("Commodities similar to 567:\n" + topItem.mkString("\n") + "\n")

    /*Check Commodity*/
    println("Given commodity name is: " + titles(itemId))
    println("Similar product names are:")
    topItem.slice(1, 11).foreach(item => println(titles(item._1)))


    println("\n-----------------------\n")

    /*Model evaluation*/
    /**
      * Mean Square Variance Assessment
      * Assessment of model full data
      */
//    val actualRating: Rating = moviesForUser.take(1)(0)
//    val predictedRating: Double = model.predict(789, actualRating.product)
//    println("\n True score:" + actualRating.rating + "  Forecast score:" + predictedRating)
    //Format:(userID,Film)
    val userProducts: RDD[(Int, Int)] = ratings.map(rating => (rating.user, rating.product))
    //The scoring information inferred by the model is in the following format:((userID,Film), Presumption Score)
    val predictions: RDD[((Int, Int), Double)] = model.predict(userProducts).map(rating => ((rating.user, rating.product),rating.rating))
    //The format is:((userID,Film), (True Flat Score, Presumptive Score))
    val ratingsAndPredictions: RDD[((Int, Int), (Double, Double))] = ratings.map(rating => ((rating.user, rating.product), rating.rating))
                                                                            .join(predictions)
    //Mean Variance
    val MSE = ratingsAndPredictions.map(rap => math.pow(rap._2._1 - rap._2._2, 2)).reduce(_+_) / ratingsAndPredictions.count()
    println("Mean Variance MSE For: " + MSE)
    //Root mean square error
    val RMSE: Double = math.sqrt(MSE)
    println("Root mean square error RMSE For: " + RMSE)

    /**
      * K Average Value Accuracy Assessment
      * Note: The evaluation model is a predictive capability for items of interest to users and returning to contact
      * That is, the evaluation of a user-recommended model at this time
      */
    /*Calculate APK metrics recommended by a single specified user*/
    val actualMovies: Seq[Int] = moviesForUser.map(_.product)
    val predictedMovies: Array[Int] = topKRecs.map(_.product)
    val apk10: Double = avgPrecisionK(actualMovies, predictedMovies, 10)
    println("789 Of APK Values are:" + apk10)

    /*Get the factor of all the goods in the model and convert it to a matrix*/
    val itemFactors: Array[Array[Double]] = model.productFeatures.map{case (id, factor) => factor}.collect()
    val itemMatrix: DoubleMatrix = new DoubleMatrix(itemFactors)
//    println(itemMatrix.rows, itemMatrix.columns)

    /*Get the ratings for each movie for each user in the model*/
    val allRecs = model.userFeatures.map{ case(userId, factor) => {
      val userVector = new DoubleMatrix(factor)
      /**
        * socres A Vector of type DoubleMatrix with a value of 1 row and N column
        * Why can scores be judged by determining the size of the product of these two matrices?
        * This is due to the ALS algorithm, which splits a user-commodity matrix into two matrices: user-commodity matrix and commodity matrix.
        * So the product of these two matrices is the actual fraction
        */
      val scores = itemMatrix.mmul(userVector)//The product of a matrix and a vector, calculating the score for each user
      //Sort by reciprocal of score
      val sortedWithId = scores.data.zipWithIndex.sortBy(-_._1)
      //(score, itemId)
      val recommendIds = sortedWithId.map(_._2 + 1).toSeq
      //Returns the reciprocal of the user's and individual item ratings. tuple: (userId,(sorce, itemId))
      (userId, recommendIds)
    }}

    /*Get the rating for each rated movie for each user in the real world*/
    val userMoives: RDD[(Int, Iterable[(Int, Int)])] = ratings.map{ case Rating(user, product, rating) => {
      (user, product)
    }}.groupBy(_._1)

    val MAPK = allRecs.join(userMoives).map{ case( userId, (predicted, actualWithIds) ) => {
      //Actual commodity number
      val actual = actualWithIds.map(_._2).toSeq
      avgPrecisionK(actual, predicted, 10)
    }}.reduce(_ + _) / allRecs.count

    println("MAPK: " + MAPK)


    println("\n-----------------------\n")

    /**
      * Use MLlib's built-in evaluator
      */
    /*RMSE And MSE*/
    val predictedAndTrue: RDD[(Double, Double)] = ratingsAndPredictions.map{ case((userID, product),(actual, predict)) => (actual, predict)}
    val regressionMetrics: RegressionMetrics = new RegressionMetrics(predictedAndTrue)
    println("Use built-in computing MSE: " + regressionMetrics.meanSquaredError)
    println("Use built-in computing RMSE: " + regressionMetrics.rootMeanSquaredError)

    /*MAPK*/
    val predictedAndTrueForRanking = allRecs.join(userMoives).map{ case( userId, (predicted, actualWithIds) ) => {
      //Actual commodity number
      val actual = actualWithIds.map(_._2)
      (actual.toArray, predicted.toArray)
    }}
    val rankingMetrics: RankingMetrics[Int] = new RankingMetrics(predictedAndTrueForRanking)
    println("Use built-in computing MAP: " + rankingMetrics.meanAveragePrecision)


  }
}

Posted by chiprivers on Thu, 23 Apr 2020 10:52:57 -0700