The content and code for this article follow Last article To write, we recommend that you take a look at Ha~.
We wrote the implementation of the movie recommendation in the last article, but is the recommendation reasonable? This requires us to evaluate the model.
For the recommended models, the models are evaluated based on the mean square deviation and the average accuracy of the K values, and MLlib provides built-in functions for each of these evaluation methods
In fact, it is necessary to continuously select different values for the three key parameters rank, iterations, lambda of the recommended model, and then evaluate the models generated by the different parameters to select the best model.
Here are two methods for evaluating recommended models ~
1. Mean Square Variance (MSE) and Root Mean Square Error (RMSE)
Definition: The sum of squared errors is the quotient of the total number.It can be interpreted as the squared difference between the predicted rating and the true rating.
The root mean square error is also widely used, and its calculation only needs to take the square root on MSE~
The evaluation code is:
//Format:(userID,Film) val userProducts: RDD[(Int, Int)] = ratings.map(rating => (rating.user, rating.product)) //The scoring information inferred by the model is in the following format:((userID,Film), Presumption Score) val predictions: RDD[((Int, Int), Double)] = model.predict(userProducts).map(rating => ((rating.user, rating.product),rating.rating)) //The format is:((userID,Film), (True Flat Score, Presumptive Score)) val ratingsAndPredictions: RDD[((Int, Int), (Double, Double))] = ratings.map(rating => ((rating.user, rating.product), rating.rating)) .join(predictions) //Mean Variance val MSE = ratingsAndPredictions.map(rap => math.pow(rap._2._1 - rap._2._2, 2)).reduce(_+_) / ratingsAndPredictions.count() println("MSE: " + MSE) //Root mean square error val RMSE: Double = math.sqrt(MSE)
println("RMSE: " + RMSE)
This is calculated by ourselves, or by the functions built into MLlib:
import org.apache.spark.mllib.evaluation.{RegressionMetrics, RankingMetrics} val predictedAndTrue: RDD[(Double, Double)] = ratingsAndPredictions.map{ case((userID, product),(actual, predict)) => (actual, predict)} val regressionMetrics: RegressionMetrics = new RegressionMetrics(predictedAndTrue) println("MSE: " + regressionMetrics.meanSquaredError) println("RMSE: " + regressionMetrics.rootMeanSquaredError)
The output is:
MSE: 0.08231947642632852 RMSE: 0.2869137090247319
2. Average accuracy of K values (MAPK)
The mean K-value accuracy (MAPK) means the average K-value accuracy (APK) over the entire dataset.APK is a common index in information retrieval.It measures the average correlation of the "first K" documents returned against a query.
The higher the actual relevance of the documents in the results and the higher the ranking, the higher the APK score.Naturally, this model is better if the item with a higher score in the predictions (and a higher rank in the recommendation list) is actually more relevant to the user.
ok, the MAPK evaluation code is as follows:
package ml import org.apache.spark.mllib.evaluation.RankingMetrics import org.apache.spark.mllib.recommendation.{Rating, ALS} import org.apache.spark.rdd.RDD import org.apache.spark.{SparkContext, SparkConf} import org.jblas.DoubleMatrix import sql.StreamingExamples import scala.collection.Map object MAPKTest{ def main(args: Array[String]) { StreamingExamples.setStreamingLogLevels() val conf = new SparkConf().setAppName("MAPKTest").setMaster("local[*]") val sc = new SparkContext(conf) /*User movie rating*/ val rawData: RDD[String] = sc.textFile("file:///E:/spark/ml-100k/u.data") //Remove the time field, rawRatings:Array val rawRatings = rawData.map(_.split("\\t").take(3)) //user moive rating val ratings = rawRatings.map{case Array(user, movie, rating) =>{ Rating(user.toInt, movie.toInt, rating.toDouble) }} /** * Training Model * Note: 50 represents the number of columns of factors for our model, called factor dimensions */ val model = ALS.train(ratings, 50, 10, 0.01) /*Get the factor of all the goods in the model and convert it to a matrix*/ val itemFactors: Array[Array[Double]] = model.productFeatures.map{case (id, factor) => factor}.collect() val itemMatrix: DoubleMatrix = new DoubleMatrix(itemFactors) // println(itemMatrix.rows, itemMatrix.columns) /*Get the ratings for each movie for each user in the model*/ val allRecs = model.userFeatures.map{ case(userId, factor) => { val userVector = new DoubleMatrix(factor) /** * socres A Vector of type DoubleMatrix with a value of 1 row and N column * Why can scores be judged by determining the size of the product of these two matrices? * This is due to the ALS algorithm, which splits a user-commodity matrix into two matrices: user-commodity matrix and commodity matrix. * So the product of these two matrices is the actual fraction */ val scores = itemMatrix.mmul(userVector)//The product of a matrix and a vector, calculating the score for each user //Sort by reciprocal of score val sortedWithId = scores.data.zipWithIndex.sortBy(-_._1) //(score, itemId) val recommendIds = sortedWithId.map(_._2 + 1).toSeq //Returns the reciprocal of the user's and individual item ratings. tuple: (userId,(sorce, itemId)) (userId, recommendIds) }} /*Get the rating for each rated movie for each user in the real world*/ val userMoives: RDD[(Int, Iterable[(Int, Int)])] = ratings.map{ case Rating(user, product, rating) => { (user, product) }}.groupBy(_._1) val predictedAndTrueForRanking = allRecs.join(userMoives).map{ case( userId, (predicted, actualWithIds) ) => { //Actual commodity number val actual = actualWithIds.map(_._2) (actual.toArray, predicted.toArray) }} val rankingMetrics: RankingMetrics[Int] = new RankingMetrics(predictedAndTrueForRanking) println("Use built-in computing MAP: " + rankingMetrics.meanAveragePrecision) }
The output is:
Use built-in computed MAP: 0.0630466936422453
3. Recommend model complete code
package ml import org.apache.spark.mllib.evaluation.{RegressionMetrics, RankingMetrics} import org.apache.spark.mllib.recommendation.{Rating, ALS} import org.apache.spark.rdd.RDD import org.apache.spark.{SparkContext, SparkConf} import org.jblas.DoubleMatrix import sql.StreamingExamples import scala.collection.Map /** * Recommended algorithm based on Park MLlib * ALS: least square method * * @author lwj * @date 2018/05/04 */ object Recommend{ /** * For commodity recommendation * Returns the cosine similarity between two vectors by passing in two vectors * * @param vec1 * @param vec2 * @return */ def cosineSimilarity(vec1: DoubleMatrix, vec2: DoubleMatrix): Double = { vec1.dot(vec2) / (vec1.norm2() * vec2.norm2()) } /** * Model evaluation * K Average Value Accuracy (APK) * * @param actual * @param predicted * @param k * @return */ def avgPrecisionK(actual: Seq[Int], predicted: Seq[Int], k: Int) : Double = { val predK: Seq[Int] = predicted.take(k) var score = 0.0 var numHits = 0.0 for ((p, i) <- predK.zipWithIndex){ if (actual.contains(p)){ numHits += 1.0 score += numHits / (i.toDouble + 1.0) //TODO Why divide by i.toDouble } } if (actual.isEmpty){ 1.0 }else{ score / math.min(actual.size, k).toDouble //TODO Why is it min } } def main(args: Array[String]) { StreamingExamples.setStreamingLogLevels() val conf = new SparkConf().setAppName("recommandTest").setMaster("local[*]") val sc = new SparkContext(conf) /*User movie rating*/ val rawData: RDD[String] = sc.textFile("file:///E:/spark/ml-100k/u.data") //Remove the time field, rawRatings:Array val rawRatings = rawData.map(_.split("\\t").take(3)) //user moive rating val ratings = rawRatings.map{case Array(user, movie, rating) =>{ Rating(user.toInt, movie.toInt, rating.toDouble) }} //Film val movies: RDD[String] = sc.textFile("file:///E:/spark/ml-100k/u.item") //Film ID Movie Title val titles: Map[Int, String] = movies.map(_.split("\\|").take(2)).map(array => (array(0).toInt, array(1))).collectAsMap() /** * Training Model * Note: 50 represents the number of columns of factors for our model, called factor dimensions */ val model = ALS.train(ratings, 50, 10, 0.01) /** * Recommendation based on user */ //Number of user factors // println(mode.userFeatures.count()) //Number of commodity factors // println(mode.productFeatures.count()) //View a user's forecast rating for a commodity. ALS Initialization of the model is random, so the results may be different // println(mode.predict(789, 123)) //Recommended for specified users N Goods val userID = 789 val K = 10 val topKRecs: Array[Rating] = model.recommendProducts(userID, 10) // println(topKRecs.mkString("\n")) //Get the movie rated by the specified user val moviesForUser: Seq[Rating] = ratings.keyBy(_.user).lookup(789) //Print out the names and ratings of the top 10 movies rated by a given user println("Actual:") moviesForUser.sortBy(-_.rating).take(10).map(rating => { (titles(rating.product),rating.rating) }).foreach(println) //Print out the names and ratings of the 10 movies recommended to users and compare them with the above println("Recommended:") topKRecs.map(rating => { (titles(rating.product),rating.rating) }).foreach(println) println("\n-----------------------\n") /** * Recommendation based on merchandise */ /*Get goods similar to this one through the commodity ID*/ val itemId = 567 val itemFactor: Array[Double] = model.productFeatures.lookup(itemId).head val itemVector: DoubleMatrix = new DoubleMatrix(itemFactor) //Get the cosine similarity of each item to the given item val sims = model.productFeatures.map{case (id, factor) => { val factorVector = new DoubleMatrix(factor) val sim = cosineSimilarity(factorVector, itemVector) (id, sim) }} //Before Printing N Commodities val topItem: Array[(Int, Double)] = sims.sortBy(-_._2).take(10 + 1) println("Commodities similar to 567:\n" + topItem.mkString("\n") + "\n") /*Check Commodity*/ println("Given commodity name is: " + titles(itemId)) println("Similar product names are:") topItem.slice(1, 11).foreach(item => println(titles(item._1))) println("\n-----------------------\n") /*Model evaluation*/ /** * Mean Square Variance Assessment * Assessment of model full data */ // val actualRating: Rating = moviesForUser.take(1)(0) // val predictedRating: Double = model.predict(789, actualRating.product) // println("\n True score:" + actualRating.rating + " Forecast score:" + predictedRating) //Format:(userID,Film) val userProducts: RDD[(Int, Int)] = ratings.map(rating => (rating.user, rating.product)) //The scoring information inferred by the model is in the following format:((userID,Film), Presumption Score) val predictions: RDD[((Int, Int), Double)] = model.predict(userProducts).map(rating => ((rating.user, rating.product),rating.rating)) //The format is:((userID,Film), (True Flat Score, Presumptive Score)) val ratingsAndPredictions: RDD[((Int, Int), (Double, Double))] = ratings.map(rating => ((rating.user, rating.product), rating.rating)) .join(predictions) //Mean Variance val MSE = ratingsAndPredictions.map(rap => math.pow(rap._2._1 - rap._2._2, 2)).reduce(_+_) / ratingsAndPredictions.count() println("Mean Variance MSE For: " + MSE) //Root mean square error val RMSE: Double = math.sqrt(MSE) println("Root mean square error RMSE For: " + RMSE) /** * K Average Value Accuracy Assessment * Note: The evaluation model is a predictive capability for items of interest to users and returning to contact * That is, the evaluation of a user-recommended model at this time */ /*Calculate APK metrics recommended by a single specified user*/ val actualMovies: Seq[Int] = moviesForUser.map(_.product) val predictedMovies: Array[Int] = topKRecs.map(_.product) val apk10: Double = avgPrecisionK(actualMovies, predictedMovies, 10) println("789 Of APK Values are:" + apk10) /*Get the factor of all the goods in the model and convert it to a matrix*/ val itemFactors: Array[Array[Double]] = model.productFeatures.map{case (id, factor) => factor}.collect() val itemMatrix: DoubleMatrix = new DoubleMatrix(itemFactors) // println(itemMatrix.rows, itemMatrix.columns) /*Get the ratings for each movie for each user in the model*/ val allRecs = model.userFeatures.map{ case(userId, factor) => { val userVector = new DoubleMatrix(factor) /** * socres A Vector of type DoubleMatrix with a value of 1 row and N column * Why can scores be judged by determining the size of the product of these two matrices? * This is due to the ALS algorithm, which splits a user-commodity matrix into two matrices: user-commodity matrix and commodity matrix. * So the product of these two matrices is the actual fraction */ val scores = itemMatrix.mmul(userVector)//The product of a matrix and a vector, calculating the score for each user //Sort by reciprocal of score val sortedWithId = scores.data.zipWithIndex.sortBy(-_._1) //(score, itemId) val recommendIds = sortedWithId.map(_._2 + 1).toSeq //Returns the reciprocal of the user's and individual item ratings. tuple: (userId,(sorce, itemId)) (userId, recommendIds) }} /*Get the rating for each rated movie for each user in the real world*/ val userMoives: RDD[(Int, Iterable[(Int, Int)])] = ratings.map{ case Rating(user, product, rating) => { (user, product) }}.groupBy(_._1) val MAPK = allRecs.join(userMoives).map{ case( userId, (predicted, actualWithIds) ) => { //Actual commodity number val actual = actualWithIds.map(_._2).toSeq avgPrecisionK(actual, predicted, 10) }}.reduce(_ + _) / allRecs.count println("MAPK: " + MAPK) println("\n-----------------------\n") /** * Use MLlib's built-in evaluator */ /*RMSE And MSE*/ val predictedAndTrue: RDD[(Double, Double)] = ratingsAndPredictions.map{ case((userID, product),(actual, predict)) => (actual, predict)} val regressionMetrics: RegressionMetrics = new RegressionMetrics(predictedAndTrue) println("Use built-in computing MSE: " + regressionMetrics.meanSquaredError) println("Use built-in computing RMSE: " + regressionMetrics.rootMeanSquaredError) /*MAPK*/ val predictedAndTrueForRanking = allRecs.join(userMoives).map{ case( userId, (predicted, actualWithIds) ) => { //Actual commodity number val actual = actualWithIds.map(_._2) (actual.toArray, predicted.toArray) }} val rankingMetrics: RankingMetrics[Int] = new RankingMetrics(predictedAndTrueForRanking) println("Use built-in computing MAP: " + rankingMetrics.meanAveragePrecision) } }