Akka (23): Stream: Custom defined stream processing stages

Keywords: Scala Java

Overall, akka-stream is composed of three framework stream components: data source Source, flow node Flow and data flow end Sink. Among them: Source and Link are two independent endpoints of stream, while Flow may be composed of multiple channel nodes between stream Source and Link. Each node represents the transformation processing function of some data stream elements, and their link order may represent the process of the whole job. A complete data stream (runnable data stream) must be a closed data stream, that is, from the appearance, the two ends of the data stream must be connected by a Source and a Sink. We can connect a Sink directly to a Source to get the simplest runnable data stream, as follows:

  Source(1 to 10).runWith(Sink.foreach(println))

From another point of view, akka-stream includes data flow graph Graph and arithmetic Materializer. Graph represents the operation scheme. Materializer is responsible for preparing the environment and putting the operation scheme Graph into the Actor system to produce effects and obtain the operation results. So: akka-stream must have a Graph description of functions and processes. Each Graph can be composed of sub-Graphs representing smaller functions. A runnable data stream must be represented by a closed graph, which in turn consists of sub-graph s representing different data transformation functions. Customized data flow function is to customize Graph according to its functional requirements.

A Graph can be described by GraphShape and GraphStage: GraphShape describes the number of input and output ports of Graph, and GraphStage describes the process of data transformation in the flow. Let's first analyze Graph Shape, whose base class is Shape:

/**
 * A Shape describes the inlets and outlets of a [[Graph]]. In keeping with the
 * philosophy that a Graph is a freely reusable blueprint, everything that
 * matters from the outside are the connections that can be made with it,
 * otherwise it is just a black box.
 */
abstract class Shape {
  /**
   * Scala API: get a list of all input ports
   */
  def inlets: immutable.Seq[Inlet[_]]

  /**
   * Scala API: get a list of all output ports
   */
  def outlets: immutable.Seq[Outlet[_]]

  /**
   * Create a copy of this Shape object, returning the same type as the
   * original; this constraint can unfortunately not be expressed in the
   * type system.
   */
  def deepCopy(): Shape
...}

Shape's subclasses must implement these three abstract functions. akka-stream provides some basic shapes in advance, including SourceShape/FlowShape/SinkShape:

/**
 * A Source [[Shape]] has exactly one output and no inputs, it models a source
 * of data.
 */
final case class SourceShape[+T](out: Outlet[T @uncheckedVariance]) extends Shape {
  override val inlets: immutable.Seq[Inlet[_]] = EmptyImmutableSeq
  override val outlets: immutable.Seq[Outlet[_]] = out :: Nil

  override def deepCopy(): SourceShape[T] = SourceShape(out.carbonCopy())
}
object SourceShape {
  /** Java API */
  def of[T](outlet: Outlet[T @uncheckedVariance]): SourceShape[T] =
    SourceShape(outlet)
}

/**
 * A Flow [[Shape]] has exactly one input and one output, it looks from the
 * outside like a pipe (but it can be a complex topology of streams within of
 * course).
 */
final case class FlowShape[-I, +O](in: Inlet[I @uncheckedVariance], out: Outlet[O @uncheckedVariance]) extends Shape {
  override val inlets: immutable.Seq[Inlet[_]] = in :: Nil
  override val outlets: immutable.Seq[Outlet[_]] = out :: Nil

  override def deepCopy(): FlowShape[I, O] = FlowShape(in.carbonCopy(), out.carbonCopy())
}
object FlowShape {
  /** Java API */
  def of[I, O](inlet: Inlet[I @uncheckedVariance], outlet: Outlet[O @uncheckedVariance]): FlowShape[I, O] =
    FlowShape(inlet, outlet)
}

There is also a slightly more complex bidirectional flow shape, BidiShape:

//#bidi-shape
/**
 * A bidirectional flow of elements that consequently has two inputs and two
 * outputs, arranged like this:
 *
 * {{{
 *        +------+
 *  In1 ~>|      |~> Out1
 *        | bidi |
 * Out2 <~|      |<~ In2
 *        +------+
 * }}}
 */
final case class BidiShape[-In1, +Out1, -In2, +Out2](
  in1:  Inlet[In1 @uncheckedVariance],
  out1: Outlet[Out1 @uncheckedVariance],
  in2:  Inlet[In2 @uncheckedVariance],
  out2: Outlet[Out2 @uncheckedVariance]) extends Shape {
  //#implementation-details-elided
  override val inlets: immutable.Seq[Inlet[_]] = in1 :: in2 :: Nil
  override val outlets: immutable.Seq[Outlet[_]] = out1 :: out2 :: Nil

  /**
   * Java API for creating from a pair of unidirectional flows.
   */
  def this(top: FlowShape[In1, Out1], bottom: FlowShape[In2, Out2]) = this(top.in, top.out, bottom.in, bottom.out)

  override def deepCopy(): BidiShape[In1, Out1, In2, Out2] =
    BidiShape(in1.carbonCopy(), out1.carbonCopy(), in2.carbonCopy(), out2.carbonCopy())

  //#implementation-details-elided
}
//#bidi-shape
object BidiShape {
  def fromFlows[I1, O1, I2, O2](top: FlowShape[I1, O1], bottom: FlowShape[I2, O2]): BidiShape[I1, O1, I2, O2] =
    BidiShape(top.in, top.out, bottom.in, bottom.out)

  /** Java API */
  def of[In1, Out1, In2, Out2](
    in1:  Inlet[In1 @uncheckedVariance],
    out1: Outlet[Out1 @uncheckedVariance],
    in2:  Inlet[In2 @uncheckedVariance],
    out2: Outlet[Out2 @uncheckedVariance]): BidiShape[In1, Out1, In2, Out2] =
    BidiShape(in1, out1, in2, out2)

}

There are also one-to-many Uniform Fan Out Shape and one-to-one Uniform Fan In Shape. Here's a multi-to-many Shape we customized:

  case class TwoThreeShape[I, I2, O, O2, O3](
                                              in1: Inlet[I],
                                              in2: Inlet[I2],
                                              out1: Outlet[O],
                                              out2: Outlet[O2],
                                              out3: Outlet[O3]) extends Shape {

    override def inlets: immutable.Seq[Inlet[_]] = in1 :: in2 :: Nil

    override def outlets: immutable.Seq[Outlet[_]] = out1 :: out2 :: out3 :: Nil

    override def deepCopy(): Shape = TwoThreeShape(
      in1.carbonCopy(),
      in2.carbonCopy(),
      out1.carbonCopy(),
      out2.carbonCopy(),
      out3.carbonCopy()
    )
  }

This is a two-in-three-out shape. We just need to implement inlets,outlets and deepCopy.

GraphStage describes the behavior of data flow components, and defines the specific functions of flow components by the way data flow elements flow in and out of components and the changes in the flow process. The following is the type definition of GraphStage:

/**
 * A GraphStage represents a reusable graph stream processing stage. A GraphStage consists of a [[Shape]] which describes
 * its input and output ports and a factory function that creates a [[GraphStageLogic]] which implements the processing
 * logic that ties the ports together.
 */
abstract class GraphStage[S <: Shape] extends GraphStageWithMaterializedValue[S, NotUsed] {
  final override def createLogicAndMaterializedValue(inheritedAttributes: Attributes): (GraphStageLogic, NotUsed) =
    (createLogic(inheritedAttributes), NotUsed)

  @throws(classOf[Exception])
  def createLogic(inheritedAttributes: Attributes): GraphStageLogic
}

Each component needs to design GraphStageLogic functionality by implementing createLogic on demand. GraphStageLogic is defined as follows:

/**
 * Represents the processing logic behind a [[GraphStage]]. Roughly speaking, a subclass of [[GraphStageLogic]] is a
 * collection of the following parts:
 *  * A set of [[InHandler]] and [[OutHandler]] instances and their assignments to the [[Inlet]]s and [[Outlet]]s
 *    of the enclosing [[GraphStage]]
 *  * Possible mutable state, accessible from the [[InHandler]] and [[OutHandler]] callbacks, but not from anywhere
 *    else (as such access would not be thread-safe)
 *  * The lifecycle hooks [[preStart()]] and [[postStop()]]
 *  * Methods for performing stream processing actions, like pulling or pushing elements
 *
 * The stage logic is completed once all its input and output ports have been closed. This can be changed by
 * setting `setKeepGoing` to true.
 *
 * The `postStop` lifecycle hook on the logic itself is called once all ports are closed. This is the only tear down
 * callback that is guaranteed to happen, if the actor system or the materializer is terminated the handlers may never
 * see any callbacks to `onUpstreamFailure`, `onUpstreamFinish` or `onDownstreamFinish`. Therefore stage resource
 * cleanup should always be done in `postStop`.
 */
abstract class GraphStageLogic private[stream] (val inCount: Int, val outCount: Int) {...}

GraphStageLogic is mainly responsible for controlling the transformation of elements and the flow mode on the port by responding to the events of the input port through InHandler and OutHandler.

/**
 * Collection of callbacks for an input port of a [[GraphStage]]
 */
trait InHandler {
  /**
   * Called when the input port has a new element available. The actual element can be retrieved via the
   * [[GraphStageLogic.grab()]] method.
   */
  @throws(classOf[Exception])
  def onPush(): Unit

  /**
   * Called when the input port is finished. After this callback no other callbacks will be called for this port.
   */
  @throws(classOf[Exception])
  def onUpstreamFinish(): Unit = GraphInterpreter.currentInterpreter.activeStage.completeStage()

  /**
   * Called when the input port has failed. After this callback no other callbacks will be called for this port.
   */
  @throws(classOf[Exception])
  def onUpstreamFailure(ex: Throwable): Unit = GraphInterpreter.currentInterpreter.activeStage.failStage(ex)
}

/**
 * Collection of callbacks for an output port of a [[GraphStage]]
 */
trait OutHandler {
  /**
   * Called when the output port has received a pull, and therefore ready to emit an element, i.e. [[GraphStageLogic.push()]]
   * is now allowed to be called on this port.
   */
  @throws(classOf[Exception])
  def onPull(): Unit

  /**
   * Called when the output port will no longer accept any new elements. After this callback no other callbacks will
   * be called for this port.
   */
  @throws(classOf[Exception])
  def onDownstreamFinish(): Unit = {
    GraphInterpreter
      .currentInterpreter
      .activeStage
      .completeStage()
  }
}

As you can see, we need to implement InHandler.onPush() and OutHandler.onPull. akka-stream implements Reactive-Stream-Specification in every link of data stream, so for input port InHandler, it needs to respond to the upstream push signal onPush, and output port OutHandler needs to respond to the downstream read signal onPull. As far as the component itself is concerned, it needs to pull(in) from the input port and push(out) from the output port.

Let's demonstrate how to design a Source that generates a string of specified characters in a loop. Source has only one output port. We only need to observe the read signal downstream of the output port. So in this case we just need to rewrite the function OutHandler:

class AlphaSource(chars: Seq[String]) extends GraphStage[SourceShape[String]] {
  val outport = Outlet[String]("output")
  val shape = SourceShape(outport)

  override def createLogic(inheritedAttributes: Attributes): GraphStageLogic =
    new GraphStageLogic(shape) {
      var pos: Int = 0
      setHandler(outport,new OutHandler {
        override def onPull(): Unit = {
          push(outport,chars(pos))
          pos += 1
          if (pos == chars.length) pos = 0
        }
      })
    }
}

The GraphStage class is a subclass of Graph:

abstract class GraphStage[S <: Shape] extends GraphStageWithMaterializedValue[S, NotUsed] {...}
abstract class GraphStageWithMaterializedValue[+S <: Shape, +M] extends Graph[S, M] {...}

So we can use AlphaSource as Graph and then use Source.fromGraph to build Source components:

  val sourceGraph: Graph[SourceShape[String],NotUsed] = new AlphaSource(Seq("A","B","C","D"))
  val alphaSource = Source.fromGraph(sourceGraph).delay(1.second,DelayOverflowStrategy.backpressure)
  alphaSource.runWith(Sink.foreach(println))

Similarly for Sink: We just need to look at the upstream push signal and read the data:

class UppercaseSink extends GraphStage[SinkShape[String]] {
  val inport = Inlet[String]("input")
  val shape = SinkShape(inport)

  override def createLogic(inheritedAttributes: Attributes): GraphStageLogic =
    new GraphStageLogic(shape) with InHandler {

      override def preStart(): Unit = pull(inport)

      override def onPush(): Unit = {
        println(grab(inport).toUpperCase)
        pull(inport)
      }

      setHandler(inport,this)

    }
}

From the above AlphaSource, Uppercase Sink, we have tried a little bit to control the flow of data stream elements, mainly to take a passive response to the change of the state of the output and input ports: to operate the ports through push and pull. Here are some common port state events and their operation methods:

The output port state change event is captured by the callback function in OutHandler. Register the OutHandler instance with setHandler(out,outHandler). The following is the operation function for the output port:

1. push(out,elem): Pushing data out of ports only allows downstream use of pull to request data to be read before it can be invoked many times before.

2. Comple (out): Normal manual port closure

3. fail(out,exeption): Abnormal manual port closure

Output port response events include:

1. onPull(): Downstream can receive data, and then push(out,elem) can be used to send data to the output port.

2. onDownStreamFinish(): Downstream terminates reading data and will not receive any onPull events thereafter

The following function can obtain the current state of the output port:

1. isAvailable(out): true represents the use of push(out,elem)

2. isClosed(out): true represents that the output port has been closed and cannot listen to events or push data

Similarly, state capture of input ports is achieved by calling back in the inHandler registered with setHandler(in,inHandler). The input port operation functions include:

1. pull(in): Require to read the data upstream, only allow the upstream data to be used after the completion of the data push, before many calls are not allowed.

2. grab(in): Read the current data from the port and use it only after the data push has been completed upstream, which can't be called many times.

3. cancel(in): Manually close the input port

Input port events:

1. onPush(): Upstream has sent data to the input port, at this time you can use grab(in) to read the current data, and pull(in) to request the next data upstream.

2. onUpstream Finish (): The upstream has terminated data transmission and will not capture onPush events after that. No pull(in) can be used to request data upstream.

3. onUpstream Falure (): Upstream Abnormal Termination

Get the input port status method:

1. isAvailable(in): true means that you can now use grab(in) to read the current data

2. hasBeenPulled(in): true represents that data access requirements have been made using pull(in). In this state, pull(in) is not allowed to be used again.

3. isClosed(in): true represents that the port has been closed, pull(in) cannot be applied and onPush events cannot be captured thereafter

From the functional description of pull(in) and push(out,elem) above, it can be concluded that they are strictly interdependent and mutually cyclical. That is, the upstream must push(out) before the downstream pull(in), and the upstream must pull(in) before the downstream push(out,elem). This is easy to understand, because akka-stream is Reactive-Stream, is push,pull mode of communication between upstream and downstream. But this is not convenient for some application scenarios, such as data flow control. Akka-stream also provides a simpler API that allows users to operate ports more flexibly. The functions in this API include the following:

1. emit(out,elem): temporarily replace OutHandler, send elem to port, and then restore OutHandler

2. emitMultiple(out,Iterable(e1,e2,e3...): Replace OutHandler temporarily, send a string of data to the port, and then restore OutHandler

3. Read (in) (and Then): Replace InHandler temporarily, read a data element from the port, and then restore InHandler

4. readN (in) (and Then): Replace InHandler temporarily, read n data elements from the port, and then restore InHandler

5. abortEmitting(): Cancel incomplete data push on the output port

6. abortReading(): Cancel the incomplete read operation on the input port

This API actually supports reactive-stream-backpressure, as we can see from emitMultiple function source code:

 /**
   * Emit a sequence of elements through the given outlet and continue with the given thunk
   * afterwards, suspending execution if necessary.
   * This action replaces the [[OutHandler]] for the given outlet if suspension
   * is needed and reinstalls the current handler upon receiving an `onPull()`
   * signal (before invoking the `andThen` function).
   */
  final protected def emitMultiple[T](out: Outlet[T], elems: Iterator[T], andThen: () ⇒ Unit): Unit =
    if (elems.hasNext) {
      if (isAvailable(out)) {
        push(out, elems.next())
        if (elems.hasNext)
          setOrAddEmitting(out, new EmittingIterator(out, elems, getNonEmittingHandler(out), andThen))
        else andThen()
      } else {
        setOrAddEmitting(out, new EmittingIterator(out, elems, getNonEmittingHandler(out), andThen))
      }
    } else andThen()

Next, we customize a Flow GraphStage, which uses read/emit to allow user-defined functions to control the flow and filtering of data stream elements. For Flow, attention should also be paid to the upstream push data status of the input port and the upstream and downstream read request status of the output port.

trait Row
trait Move
case object Stand extends Move
case class Next(rows: Iterable[Row]) extends Move

class FlowValve(controller: Row => Move) extends GraphStage[FlowShape[Row,Row]] {
  val inport = Inlet[Row]("input")
  val outport = Outlet[Row]("output")
  val shape = FlowShape.of(inport,outport)

  override def createLogic(inheritedAttributes: Attributes): GraphStageLogic =
    new GraphStageLogic(shape) with InHandler with OutHandler {
      override def onPush(): Unit = {
        controller(grab(inport)) match {
          case Next(rows) => emitMultiple(outport,rows)
          case _ => pull(inport)
        }
      }
      override def onPull(): Unit = pull(inport)
      setHandlers(inport,outport,this)
    }
}

The FlowValve type above is specifically designed to apply a user-defined function controller. The controller function determines that Stand sends one or more elements downstream over the current data element or Next(...) based on the content of the data element pushed upstream. FlowValve passes it directly to the upstream when the downstream can accept data for pull requests. Here is an example of a user-defined function:

 case class Order(burger: String, qty: Int) extends Row
  case class Burger(msg: String) extends Row

  def orderDeliver: Row => Move = order => {
    order match {
      case Order(name,qty) =>

        if (qty > 0) {
          val burgers: Iterable[Burger] =
            (1 to qty).foldLeft(Iterable[Burger]()) { (b, a) =>
              b ++ Iterable(Burger(s"$name $a of ${qty}"))
            }
          Next(burgers)
        } else Stand
    }
  }


  val flowGraph: Graph[FlowShape[Row,Row],NotUsed] = new FlowValve(orderDeliver)
  val deliverFlow: Flow[Row,Row,NotUsed] = Flow.fromGraph(flowGraph)
  val orders = List(Order("cheeze",2),Order("beef",3),Order("pepper",1),Order("Rice",0)
                    ,Order("plain",1),Order("beef",2))

  Source(orders).via(deliverFlow).to(Sink.foreach(println)).run()

The results of the trial operation are as follows:

 

Burger(cheeze 1 of 2)
Burger(cheeze 2 of 2)
Burger(beef 1 of 3)
Burger(beef 2 of 3)
Burger(beef 3 of 3)
Burger(pepper 1 of 1)
Burger(plain 1 of 1)
Burger(beef 1 of 2)
Burger(beef 2 of 2)

 

That's exactly what we expected. Uniform FanIn and Uniform FanOut GraphStages are provided for a pair of multi-diffusive and multi-to-one merged data stream components akka-stream. Combining these two components can build many-to-many shapes, so the preset GraphStage is sufficient.

The following is the source code involved in this demonstration:

import akka.NotUsed
import akka.actor._
import akka.stream.ActorMaterializer
import akka.stream.scaladsl._
import akka.stream.stage._
import akka.stream._
import scala.concurrent.duration._
import scala.collection.immutable.Iterable

class AlphaSource(chars: Seq[String]) extends GraphStage[SourceShape[String]] {
  val outport = Outlet[String]("output")
  val shape = SourceShape(outport)

  override def createLogic(inheritedAttributes: Attributes): GraphStageLogic =
    new GraphStageLogic(shape) {
      var pos: Int = 0
      setHandler(outport,new OutHandler {
        override def onPull(): Unit = {
          push(outport,chars(pos))
          pos += 1
          if (pos == chars.length) pos = 0
        }
      })
    }
}
class UppercaseSink extends GraphStage[SinkShape[String]] {
  val inport = Inlet[String]("input")
  val shape = SinkShape(inport)

  override def createLogic(inheritedAttributes: Attributes): GraphStageLogic =
    new GraphStageLogic(shape) with InHandler {

      override def preStart(): Unit = pull(inport)

      override def onPush(): Unit = {
        println(grab(inport).toUpperCase)
        pull(inport)
      }

      setHandler(inport,this)

    }
}

trait Row
trait Move
case object Stand extends Move
case class Next(rows: Iterable[Row]) extends Move

class FlowValve(controller: Row => Move) extends GraphStage[FlowShape[Row,Row]] {
  val inport = Inlet[Row]("input")
  val outport = Outlet[Row]("output")
  val shape = FlowShape.of(inport,outport)

  override def createLogic(inheritedAttributes: Attributes): GraphStageLogic =
    new GraphStageLogic(shape) with InHandler with OutHandler {
      override def onPush(): Unit = {
        controller(grab(inport)) match {
          case Next(rows) => emitMultiple(outport,rows)
          case _ => pull(inport)
        }
      }
      override def onPull(): Unit = pull(inport)
      setHandlers(inport,outport,this)
    }
}


object GraphStages extends App {
  implicit val sys = ActorSystem("demoSys")
  implicit val ec = sys.dispatcher
  implicit val mat = ActorMaterializer(
    ActorMaterializerSettings(sys)
      .withInputBuffer(initialSize = 16, maxSize = 16)
  )

  val sourceGraph: Graph[SourceShape[String],NotUsed] = new AlphaSource(Seq("a","b","c","d"))
  val alphaSource = Source.fromGraph(sourceGraph).delay(1.second,DelayOverflowStrategy.backpressure)
  // alphaSource.runWith(Sink.foreach(println))

  val sinkGraph: Graph[SinkShape[String],NotUsed] = new UppercaseSink
  val upperSink = Sink.fromGraph(sinkGraph)
  alphaSource.runWith(upperSink)

  case class Order(burger: String, qty: Int) extends Row
  case class Burger(msg: String) extends Row

  def orderDeliver: Row => Move = order => {
    order match {
      case Order(name,qty) =>

        if (qty > 0) {
          val burgers: Iterable[Burger] =
            (1 to qty).foldLeft(Iterable[Burger]()) { (b, a) =>
              b ++ Iterable(Burger(s"$name $a of ${qty}"))
            }
          Next(burgers)
        } else Stand
    }
  }


  val flowGraph: Graph[FlowShape[Row,Row],NotUsed] = new FlowValve(orderDeliver)
  val deliverFlow: Flow[Row,Row,NotUsed] = Flow.fromGraph(flowGraph)
  val orders = List(Order("cheeze",2),Order("beef",3),Order("pepper",1),Order("Rice",0)
                    ,Order("plain",1),Order("beef",2))

  Source(orders).via(deliverFlow).to(Sink.foreach(println)).run()


  // Source(1 to 10).runWith(Sink.foreach(println))

    scala.io.StdIn.readLine()
  sys.terminate()

}

Posted by pbsonawane on Sun, 06 Jan 2019 14:18:09 -0800