Spark read csv file operation, option parameter explanation

Keywords: Scala Spark Machine Learning

import com.bean.Yyds1
import org.apache.spark.sql.SparkSession

object TestReadCSV {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
      .appName("CSV Reader")
      .master("local")
      .getOrCreate()
    /** *   Parameters can be either strings or specific types, such as boolean
     * delimiter Separator, comma by default,
     * nullValue Specify a string to represent the null value
     * quote Quote character, double quote by default
     * header First row not as data content, as heading
     * inferSchema Auto guess field type
     * ignoreLeadingWhiteSpace Trim the space in front
     * ignoreTrailingWhiteSpace Trim the space behind
     * nullValue Null setting, if you don't want to use any symbols as null values, you can assign null
     * multiline  Run multiple columns, using over 62 columns
     * encoding   Specify encoding, such as GBK / UTF-8 Unicode GB2312
     * ** */

      import spark.implicits._
    val result = spark.read.format("csv")
      .option("delimiter", "\\t")
      .option("encoding","GB2312")
      .option("enforceSchema",false)
      .option("header", "true")
//      .option("header", false)
      .option("quote", "'")
      .option("nullValue", "\\N")
      .option("ignoreLeadingWhiteSpace", false)
      .option("ignoreTrailingWhiteSpace", false)
      .option("nullValue", null)
      .option("multiline", "true")
      .load("G:\\python\\yyds\\yyds_1120_tab.csv").as[Yyds1] //yyds_1120_tab.csv  aa1.csv   yyds_20211120  yyds_1120_tab2_utf-8


    result.map(row => {
      row.ji_check_cnt.toInt
    }).foreachPartition(a => {a.foreach(println _)})

  }
}

pom dependency

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>org.example</groupId>
    <artifactId>TmLimitPredict</artifactId>
    <version>1.0-SNAPSHOT</version>

    <properties>
        <log4j.version>1.2.17</log4j.version>
        <slf4j.version>1.7.22</slf4j.version>
        <!--0.8.2-beta   0.8.2.0    0.8.2.1   0.8.2.2   0.9.0.1   0.10.0.0
                        0.10.1.0   0.10.0.1   0.10.2.0   1.0.0   2.8.0-->
        <kafka.version>2.8.0</kafka.version>
        <spark.version>2.2.0</spark.version>
        <scala.version>2.11.8</scala.version>
        <jblas.version>1.2.1</jblas.version>
        <hadoop.version>2.7.3</hadoop.version>
    </properties>


    <dependencies>
        <!--Introducing a common log management tool-->
        <!--        <dependency>-->
        <!--            <groupId>org.slf4j</groupId>-->
        <!--            <artifactId>jcl-over-slf4j</artifactId>-->
        <!--            <version>${slf4j.version}</version>-->
        <!--        </dependency>-->
        <!--        <dependency>-->
        <!--            <groupId>org.slf4j</groupId>-->
        <!--            <artifactId>slf4j-api</artifactId>-->
        <!--            <version>${slf4j.version}</version>-->
        <!--        </dependency>-->
        <!--        <dependency>-->
        <!--            <groupId>org.slf4j</groupId>-->
        <!--            <artifactId>slf4j-log4j12</artifactId>-->
        <!--            <version>${slf4j.version}</version>-->
        <!--        </dependency>-->
        <!--        <dependency>-->
        <!--            <groupId>log4j</groupId>-->
        <!--            <artifactId>log4j</artifactId>-->
        <!--            <version>${log4j.version}</version>-->
        <!--        </dependency>-->
        <!-- Spark Dependency introduction of -->

        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-common</artifactId>
            <version>${hadoop.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-mapreduce-client-core</artifactId>
            <version>${hadoop.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>${hadoop.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-hdfs</artifactId>
            <version>${hadoop.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>${spark.version}</version>
            <!--<scope>provided</scope>-->
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.11</artifactId>
            <version>${spark.version}</version>
            <exclusions>
                <exclusion>
                    <groupId>com.google.guava</groupId>
                    <artifactId>guava</artifactId>
                </exclusion>
            </exclusions>
            <!--<scope>provided</scope>-->
        </dependency>
        <dependency>
            <groupId>com.google.guava</groupId>
            <artifactId>guava</artifactId>
            <version>15.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-hive_2.11</artifactId>
            <version>${spark.version}</version>
            <!--<scope>provided</scope>-->
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming_2.11</artifactId>
            <version>${spark.version}</version>
            <!--<scope>provided</scope>-->
        </dependency>
        <!-- Introduce Scala -->
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>${scala.version}</version>
            <!--<scope>provided</scope>-->
        </dependency>

        <!--MLlib-->
        <dependency>
            <groupId>org.scalanlp</groupId>
            <artifactId>jblas</artifactId>
            <version>${jblas.version}</version>
            <!--<scope>provided</scope>-->
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-mllib_2.11</artifactId>
            <version>${spark.version}</version>
            <!--<scope>provided</scope>-->
        </dependency>


        <!-- kafka -->
        <dependency>
            <groupId>org.apache.kafka</groupId>
            <artifactId>kafka-clients</artifactId>
            <version>${kafka.version}</version>
            <!--<scope>provided</scope>-->
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
            <version>${spark.version}</version>
            <!--<scope>provided</scope>-->
        </dependency>
        <dependency>
            <groupId>com.sf.kafka</groupId>
            <artifactId>sf-kafka-api-core</artifactId>
            <version>2.4.1</version>
            <!--<scope>provided</scope>-->
        </dependency>

        <!--     lombok   generate get,set Method Tools-->
        <dependency>
            <groupId>org.projectlombok</groupId>
            <artifactId>lombok</artifactId>
            <version>1.16.18</version>
            <scope>provided</scope>
        </dependency>
    </dependencies>


    <build>
        <!--    <sourceDirectory>src/main/scala</sourceDirectory>-->
        <sourceDirectory>src/main/java</sourceDirectory>
        <testSourceDirectory>src/test/scala</testSourceDirectory>

        <resources>
            <resource>
                <directory>src/main/resources</directory>
                <includes>
                    <include>**/*.properties</include>
                    <include>**/*.xml</include>
                </includes>
                <!-- Exclude configuration files (on runtime notes, make IDE Read the configuration file; when packaging, release comments to make the configuration file external for easy modification) can not be configured. maven-jar-plugin Configured below -->
                <!--<excludes>
                    <exclude>config.properties</exclude>
                </excludes>-->
            </resource>
            <!-- Resources external to the profile (stored in conf Directory, also classpath Path, which is configured below)-->
            <!--<resource>
                <directory>src/main/resources</directory>
                <includes>
                    <include>config.properties</include>
                </includes>
                <targetPath>${project.build.directory}/conf</targetPath>
            </resource>-->
        </resources>

        <plugins>
            <!--scala Compile Packaging Plug-ins-->
            <plugin>
                <groupId>net.alchim31.maven</groupId>
                <artifactId>scala-maven-plugin</artifactId>
                <version>3.2.2</version>
                <!--                <groupId>org.scala-tools</groupId>-->
                <!--                <artifactId>maven-scala-plugin</artifactId>-->
                <!--                <version>2.15.2</version>-->
                <executions>
                    <execution>
                        <id>scala-compile-first</id>
                        <phase>process-resources</phase>
                        <goals>
                            <goal>add-source</goal>
                            <goal>compile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>

            <!--java Compile Packaging Plug-ins-->
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.1</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                    <encoding>UTF-8</encoding>
                </configuration>
                <executions>
                    <execution>
                        <phase>compile</phase>
                        <goals>
                            <goal>compile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>

            <!--
                ③Make one zip Package, when publishing a project, will zip package copy On the server, directly unzip xxx.zip，It contains the jar And dependent lib，There are also configurations config File to start the service directly
            -->
            <plugin>
                <artifactId>maven-dependency-plugin</artifactId>
                <executions>
                    <execution>
                        <phase>process-sources</phase>

                        <goals>
                            <goal>copy-dependencies</goal>
                        </goals>

                        <configuration>
                            <excludeScope>provided</excludeScope>
                            <outputDirectory>${project.build.directory}/lib</outputDirectory>
                        </configuration>

                    </execution>
                </executions>
            </plugin>
            <!--The configuration of maven-jar-plugin-->
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-jar-plugin</artifactId>
                <version>2.4</version>
                <!--The configuration of the plugin-->
                <configuration>
                    <!-- Do not package resource files (configuration files separate from dependent packages) -->
                    <excludes>
                        <!--            <exclude>*.properties</exclude>-->
                        <!--            <exclude>*.xml</exclude>-->
                        <exclude>*.txt</exclude>
                    </excludes>
                    <!--Configuration of the archiver-->
                    <archive>
                        <!--Generated jar Do not include pom.xml and pom.properties These two files-->
                        <addMavenDescriptor>false</addMavenDescriptor>
                        <!--Manifest specific configuration-->
                        <manifest>
                            <!--Is third party included jar put to manifest Of classpath in-->
                            <!--              <addClasspath>true</addClasspath>-->
                            <addClasspath>false</addClasspath>
                            <!--Generated manifest in classpath Prefix, because you want a third party jar put to lib Directory, so classpath The prefix is lib/-->
                            <classpathPrefix>lib/</classpathPrefix>
                            <!--Applied main class-->
                            <!--              <mainClass>com.sf.tmlimit.TmLimitPredStream</mainClass>-->
                            <mainClass>ConnectKafkaTest</mainClass>
                        </manifest>
                        <!-- Add key-value pairs to the manifest file, increase classpath Path, where the conf The directory is also set to classpath Route -->
                        <manifestEntries>
                            <!--              <Class-Path>conf/</Class-Path>-->
                            <Class-Path>lib/</Class-Path>
                        </manifestEntries>
                    </archive>
                    <!--Filter out unwanted inclusions jar Files in-->
                    <!-- <excludes>
                         <exclude>${project.basedir}/xml/*</exclude>
                     </excludes>-->
                </configuration>
            </plugin>

            <!--The configuration of maven-assembly-plugin-->
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-assembly-plugin</artifactId>
                <version>2.4</version>
                <!--The configuration of the plugin-->
                <configuration>
                    <!--Specifies the configuration file of the assembly plugin-->
                    <descriptors>
                        <descriptor>src/main/assembly/assembly.xml</descriptor>
                    </descriptors>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>


</project>

parameter	explain
sep	By default, specify a single character to split fields and values
encoding	The default is uft-8 to decode by a given encoding type
quote	The default is ", where the delimiter can be part of the value and set a single character to escape the quoted value. If you want to close the quotation marks, you need to set an empty string instead of null.
escape	The default (\) setting is that a single character is used to escape quotation marks within quotation marks
charToEscapeQuoteEscaping	The default is the escape character (above escape) or \0. When the escape character is different from the quote character, the default is the escape character (otherwise\0)
comment	The default is null, which sets the single character used to skip a line, beginning with that character. By default, it is disabled
header	The default is false, with the first row as the column name
enforceSchema	The default is true, and if set to true, the specified or inferred pattern will be enforced on the data source file and the header in the CSV file will be ignored. If the option is set to false, the header option is set to true, which validates the mode against all titles in the CSV file. Field names in schemas and column names in CSV headers are checked based on their location and *spark.sql.caseSensitive is considered. Although the default value is true, it is recommended that the enforceSchema option be disabled to avoid erroneous results
inferSchema	inferSchema (default false`): Automatically infer the input mode from the data. * An additional transfer of data is required
samplingRatio	Default is 1.0, which defines the score of rows used for pattern inference
ignoreLeadingWhiteSpace	Default to false, a flag indicating whether leading spaces in the value being read should be skipped
ignoreTrailingWhiteSpace	Default to false a flag indicating whether the end space of the value being read should be skipped
nullValue	The default is an empty string, which sets the string representation of the null value. Starting with 2.0.1, this applies to all supported types, including string types
emptyValue	The default is an empty string, setting a string representation of an empty value
nanValue	The default is Nan, which sets the string representation for non-numbers
positiveInf	Default is Inf
negativeInf	Default is -Inf to set the string representation of negative infinity
dateFormat	The default is yyyy-MM-dd, which sets the string indicating the date format. Custom date formats follow the format in java.text.SimpleDateFormat. This applies to date types
timestampFormat	The default is yyyy-MM-dd'T'HH:mm:ss.SSSXXX, which sets the string representing the timestamp format. Custom date formats follow the format in java.text.SimpleDateFormat. This applies to timestamp types
maxColumns	The default is a hard setting for how many columns 20480 defines
maxCharsPerColumn	The default is -1, which defines the maximum number of characters allowed for any given value read. By default, -1 means unlimited length
mode	The default (allowed) allows a mode for handling corrupt records during parsing. It supports the following case-insensitive modes. Note that Spark attempts to resolve only the columns necessary in the CSV under column pruning. Therefore, broken records can vary depending on the set of fields required. This behavior can be controlled through spark.sql.csv.parser.columnPruning.enabled (enabled by default).

Parameters below mode l:	---------------------------------------------------
PERMISSIVE	When it encounters a corrupted record, place the malformed string in the * field configured by "columnNameOfCorruptRecord" and set the other fields to "null". To preserve corrupted records, the user can set a user-defined mode named columnNameOfCorruptRecord
DROPMALFORMED	Ignore the entire broken record
FAILFAST	Exception thrown when a corrupted record is encountered
----End of the mode l parameter--	-------------------------------------------------------

columnNameOfCorruptRecord	The default value specifies in spark.sql.columnNameOfCorruptRecord, allowing renaming of malformed new fields created by PERMISSIVE mode. This overrides the spark.sql.columnNameOfCorruptRecord
multiLine	Default is false, parsing a record that may exceed 62 columns

Posted by George Botley on Sun, 21 Nov 2021 12:20:33 -0800

Programmer Group

Spark read csv file operation, option parameter explanation

Hot Keywords