Structured Streaming Simple Data Processing - Read CSV and extract column keywords

Keywords: Big Data Spark Python SQL Hadoop


Recently, when Baidu wants to learn Spark's newer Structured Streaming, all of them are monotonous wordcount s, which are quite speechless.You have to figure out for yourself what you can do with the Select and Filter operations of the Dataframe.Because of using Python, using Pandas, and trying to turn Pandas to process, readStream does not support the direct toPandas() method.Finally, when I flip over the official API, I find that there is another powerful operation for Dataframe that can be used in readStream: UDF.

Environmental preparation

  • Hadoop 2.8.5
  • Spark 2.4.3
  • Python 3.7.3
  • jieba (jieba word segmentation tool, provides TF-IDF keyword extraction method, pip install jieba)

The code below the program is executed in an interactive environment, pyspark.

Data preparation

id title_zh content_zh publish_date  

Suppose the CSV data, as shown in the table above, represents the article id, title, content, and publishing time, respectively.

There are requirements for extracting keywords from headings and adding keywords to new columns.(There are also keywords to extract articles, but the principle is the same, so you don't write more.)

Read data

There are two steps to reading a csv file: defining the schema and reading the file according to the schema.

Define schema:

In this case, id is of type Integer, publish_date is of type TimestampType, and the rest is StringType.First introduce partial dependency:

from pyspark.sql.functions import udf
from  pyspark.sql.types import StructType,StringType,IntegerType,TimestampType

Define a StructType:


Set up a listening folder that spark will automatically read to stream when a new CSV file is generated, and the path needs to indicate either hdfs:///or file://


Determine Stream by rcsv.isStreaming

Processing data

Processing data in three steps:

  1. Define a keyword extraction method that selects the most coefficientd words and eliminates pure numbers
  2. Construct the extraction method into Udf
  3. Call Udf using Dataframe's Conversion method

Finally, select the output mode, output destination, and output the results.The following code goes directly to the comment:

#Introducing jieba dependencies
import jieba
import jieba.analyse

def getTopWord(words):
        return ""
    while (words[wordc].isdigit()):
            return words[wordc]
    return words[wordc]

def getKeyword(ctx):
    #The extract_tags method has an optional topK=N parameter to extract N words, but there is another way to exclude pure numbers here, so use its default value
    return word;

#Parameter is (method name, return type), method can be lambda, return type is required.When invoked, udf is called line by line

#Use udf method for the column ['title_zh'] and select to generate a new column, alias() to define the alias, and finally get a new readStream'id','title_zh',getKeyword_udf(rcsv['title_zh']).alias('title_zh_keyword'),'publish_date')

#Set the output mode to update, complete, and Append; output to the command line.I can't tell for more details. I suggest looking at the official documents.


All the above code, in pyspark, just enter the code one by one, the path will be replaced according to your own actual, the data content and type can be adjusted according to your own preferences.

The CSV file for this example is shown in Fig.

The execution results are as follows:


PS: Miss Sister Weibo@is just a short story, invading and deleting ~ (I don't believe anyone will report me anyway)

Posted by bigphpn00b on Wed, 11 Sep 2019 16:56:03 -0700