Read and write operations on MongoDB on SparkSql (Python version)

Keywords: Big Data Spark MongoDB SQL Python

Read and write operations on MongoDB on SparkSql (Python version)

1.1 Read mongodb data

The python approach requires the use of pyspark or spark-submit for submission.

Here's how pyspark starts:

1.1.1 Start the command line with pyspark

# Locally installed version of spark is 2.3.1, if other versions need to be modified version number and scala version number
pyspark --packages org.mongodb.spark:mongo-spark-connector_2.11:2.3.1

1.1.2 Enter the following code in the pyspark shell script:

spark = SparkSession \
        .builder \
        .appName('MyApp') \
        .config('spark.mongodb.input.uri', 'mongodb://127.0.0.1/test.user') \
        .getOrCreate()

df = spark.read.format('com.mongodb.spark.sql.DefaultSource').load()

df.createOrReplaceTempView('user')

resDf = spark.sql('select name,age,sex from user')

resDf.show()

spark.stop()

exit(0)

Results Output:

The results of the query in mongo:

Start with spark-submit

1.1.3 Write the read_mongo.py script, which is as follows:

#!/usr/bin/env python3      
# -*- coding: utf-8 -*-

from pyspark.sql import SparkSession


# The pyspark way starts, here my local spark uses the spark 2.3.1 version. For other spark versions, the version number of mongo-spark-connector is different. See the official document of mongodb specifically.
# pyspark --packages org.mongodb.spark:mongo-spark-connector_2.11:2.3.1


# spark-submit submission, I only use nohup submission
# nohup spark-submit --packages org.mongodb.spark:mongo-spark-connector_2.11:2.3.1 /Users/zhangzhiqiang/Documents/pythonproject/demo/mongodb-on-spark/read_mongo.py >> /Users/zhangzhiqiang/Documents/pythonproject/demo/mongodb-on-spark/read_mongo.log &


if __name__ == '__main__':
    spark = SparkSession \
        .builder \
        .appName('MyApp') \
        .config('spark.mongodb.input.uri', 'mongodb://127.0.0.1/test.user') \
        .getOrCreate()

    df = spark.read.format('com.mongodb.spark.sql.DefaultSource').load()

    df.createOrReplaceTempView('user')

    resDf = spark.sql('select name,age,sex from user')

    resDf.show()

    spark.stop()

    exit(0)

1.1.4 Submit using spark-submit

Here I submit in nohup mode and output the result in log file.

nohup spark-submit --packages org.mongodb.spark:mongo-spark-connector_2.11:2.3.1 /Users/zhangzhiqiang/Documents/pythonproject/demo/mongodb-on-spark/read_mongo.py >> /Users/zhangzhiqiang/Documents/pythonproject/demo/mongodb-on-spark/read_mongo.log &

1.2 Read mongo data using Schema constraints

1.2.1 adopts pyspark mode

Write the following code on the command line:

# Import package
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

spark = SparkSession \
        .builder \
        .appName('MyApp') \
        .config('spark.mongodb.input.uri', 'mongodb://127.0.0.1/test.user') \
        .getOrCreate()

# If there are too many json fields in mongodb, we can also filter out unwanted data through schema restrictions
# name is set to StringType
# age is set to IntegerType
schema = StructType([
    StructField("name", StringType()),
    StructField("age", IntegerType())
])

df = spark.read.format('com.mongodb.spark.sql.DefaultSource').schema(schema).load()

df.createOrReplaceTempView('user')

resDf = spark.sql('select * from user')

resDf.show()

spark.stop()

exit(0)

Output results:

1.3 Write mongodb data

1.3.1 in pyspark

# Import package
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

spark = SparkSession \
        .builder \
        .appName('MyApp') \
        .config('spark.mongodb.output.uri', 'mongodb://127.0.0.1/test.user') \
        .getOrCreate()

schema = StructType([
    StructField("name", StringType()),
    StructField("age", IntegerType()),
    StructField("sex", StringType())
])

df = spark.createDataFrame([('caocao', 36, 'male'), ('sunquan', 26, 'male'), ('zhugeliang', 26, 'male')], schema)

df.show()

df.write.format('com.mongodb.spark.sql.DefaultSource').mode("append").save()

spark.stop()

exit(0)

Result:

The results of the query in mongo:

github source code example

Reference documents: Mongo on Spark Python

Posted by k3Bobos on Wed, 23 Jan 2019 17:57:13 -0800

Programmer Group

Read and write operations on MongoDB on SparkSql (Python version)

Read and write operations on MongoDB on SparkSql (Python version)

1.1 Read mongodb data

1.2 Read mongo data using Schema constraints

1.3 Write mongodb data

Hot Keywords