The solution of reading the first line of text in UTF-8 format and the first character is empty

Keywords: encoding Java ascii

Record the problems encountered during the completion of the project.
When java reads UTF-8 format file, it finds a very depressing problem: when the UTF-8 format file edited by ue is read, it will read an invisible character from the first line of the file.

The test code is as follows:

package test;

import java.io.*;

public class HelloWorld {
    public static void main(String[] args) {
        String fielPath = "C:\\Users\\16223\\Desktop\\hahaha2.txt";
        //Get the encoding format of the file
        String codeString = codeString(fielPath);
        System.out.println(codeString);

        File file = new File(fielPath);
        BufferedReader reader = null;
        try {
            reader = new BufferedReader(new InputStreamReader(new FileInputStream(file),codeString));
            String tempchar;
            while ((tempchar = reader.readLine()) != null) {
                //The first character is empty when reading files in utf-8 format
                char c = tempchar.charAt(0);
                System.out.println(c);
                /*if(c==65279) {    //65279 Null character
                    System.out.println("The first character is empty“);
                    tempchar = tempchar.substring(1);
                }*/
                System.out.println(tempchar);
                System.out.println(tempchar.startsWith("create table"));
            }
        } catch (UnsupportedEncodingException e) {
            e.printStackTrace();
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        }catch (IOException e) {
            e.printStackTrace();
        }
    }

    public static String codeString(String filePath) {
        String encoding = null;
        File file = new File(filePath);
        BufferedInputStream bis = null;
        try {
            bis = new BufferedInputStream(new FileInputStream(file));
            int p = (bis.read() << 8) + bis.read();
            switch (p) {
                case 0xefbb:
                    encoding = "UTF-8";
                    break;
                case 0xfffe:
                    encoding = "Unicode";
                    break;
                case 0xfeff:
                    encoding = "UTF-8";
                    break;
                case 0x5c75:
                    encoding = "ASCII";
                    break;
                default:
                    encoding = "GBK";
            }
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            try {
                if (bis != null) {
                    bis.close();
                }
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
        return encoding;
    }
}

The results printed on the console are as follows:

UTF-8

create table aaa 'Ha ha ha';
false

It can be seen from the results that the second line is blank line, that is, when System.out.println (c) is executed, a blank line appears, which is the first formal character represented by C;

Then we know that it's the problem of Java reading BOM (Byte Order Mark). When using UTF-8, you can use three bytes of EF BB BF at the beginning of the file to identify that the file uses UTF-8 encoding to get the encoding format of the file. Of course, you can also use the three bytes. The above problem should be caused by reading the first three bytes.

resolvent:
1. Do not use BOM format code when saving files

GBK
c
create table aaa 'Crucible�';
true

The output result of the above console is that there is no BOM format. Of course, the Chinese code appears disorderly due to the direct modification of the saved format;

2. If you need to use the read content, you can also judge whether the current content has the situation mentioned above, that is, judge whether the first character is a null character. If so, you can cut off the first character

 //The first character is empty when reading files in utf-8 format
  char c = tempchar.charAt(0);
  System.out.println(c);
  if(c==65279) {    //65279 is an empty character
      System.out.println("The first character is empty");
      tempchar = tempchar.substring(1);
  }

Writing diary in the night

Published 1 original article, praised 0 and visited 5

Private letter follow

Posted by mayanktalwar1988 on Wed, 15 Jan 2020 00:14:46 -0800

Programmer Group

The solution of reading the first line of text in UTF-8 format and the first character is empty

Hot Keywords