Comparison of updateStateByKey and mapWithState

Keywords: Big Data Spark network Apache

What is a state management function

The state management functions in Spark Streaming, including updateStateByKey and mapWithState, are used to count changes in the state of the global key.They reduce by key with data from DStream, then accumulate data from each batch as new data information enters or updates.To keep users in whatever shape they want.


_updateStateByKey counts the state of the global key and returns the state of the key before each batch interval, regardless of data entry.UpdateStateByKey updates the status of existing keys and performs the same update function for each new key.If none is returned after the state is updated by the update function, the state corresponding to the key is deleted at this time (the state can be the structure of any type of data).

//[root@bda3 ~]# nc -lk 9999
object StatefulWordCountApp {
  def main(args: Array[String]) {
    val sparkConf = new SparkConf()
    val ssc = new StreamingContext(sparkConf, Seconds(10))
    //Note: updateStateByKey must set the checkpoint directory
    val lines = ssc.socketTextStream("bda3",9999)
    ssc.start()  // Be sure to write
  /*Status Update Function
  * @param currentValues  key List of identical value s
  * @param preValues      key Corresponding value, previous state
  * */
  def updateFunction(currentValues: Seq[Int], preValues: Option[Int]): Option[Int] = {
    val curr = currentValues.sum   //Sum of all value s in the seq list
    val pre = preValues.getOrElse(0)  //Get the previous status value
    Some(curr + pre)


_mapWithState also counts the state of the global key, but if there is no data input, it will not return to the previous key state, similar to the incremental feeling.

 * Counts words cumulatively in UTF8 encoded, '\n' delimited text received from the network every
 * second starting with initial value of word count.
 * Usage: StatefulNetworkWordCount <hostname> <port>
 *   <hostname> and <port> describe the TCP server that Spark Streaming would connect to receive
 *   data.
 * To run this on your local machine, you need to first run a Netcat server
 *    `$ nc -lk 9999`
 * and then run the example
 *    `$ bin/run-example
 *      org.apache.spark.examples.streaming.StatefulNetworkWordCount localhost 9999`
object StatefulNetworkWordCount {
  def main(args: Array[String]) {
    if (args.length < 2) {
      System.err.println("Usage: StatefulNetworkWordCount <hostname> <port>")
    val sparkConf = new SparkConf().setAppName("StatefulNetworkWordCount")
    // Create the context with a 1 second batch size
    val ssc = new StreamingContext(sparkConf, Seconds(1))
    // Initial state RDD for mapWithState operation
    val initialRDD = ssc.sparkContext.parallelize(List(("hello", 1), ("world", 1)))
    // Create a ReceiverInputDStream on target ip:port and count the
    // words in input stream of \n delimited test (eg. generated by 'nc')
    val lines = ssc.socketTextStream(args(0), args(1).toInt)
    val words = lines.flatMap(_.split(" "))
    val wordDstream = => (x, 1))
    // Update the cumulative count using mapWithState
    // This will give a DStream made of state (which is the cumulative count of the words)
    val mappingFunc = (word: String, one: Option[Int], state: State[Int]) => {
      val sum = one.getOrElse(0) + state.getOption.getOrElse(0)
      val output = (word, sum)
    val stateDstream = wordDstream.mapWithState(

Differences between updateStateByKey and mapWithState

_updateStateByKey can return all previous historical data, including new, changed, and unchanged data, within a specified batch interval.Because updateStateByKey must be checkpoints when it is used, checkpoints can take up a large amount of data when the amount of data is too large, affecting performance and being inefficient.

_mapWithState only returns the value of the changed key. The advantage of this is that we can only care about the key that has changed, and for no data input, we will not return the data for the key that has not changed.In this way, even with a large amount of data, checkpoint will not consume as much storage and be more efficient as updateStateByKey (this is recommended in a reproductive environment).

Scenarios applicable

_updateStateByKey can be used to statistics historical data.For example, the average consumption amount of users in different time periods, the number of times consumed, the total consumption amount, the visits of websites in different time periods, etc.

_mapWithState can be used for scenarios where real-time performance is high and latency is low, such as when you buy something from a treasure and return the balance information in your account after paying.

Posted by w00kie on Mon, 01 Jun 2020 20:53:52 -0700