C2 A0 -> NO-BREAK SPACE with special spaces in UTF-8 encoding

Keywords: Java encoding Database

Exception data tracking

Recently found a problem with field value data exception in database. Spaces are not allowed in this string field in business scenarios, but some data still has "spaces". After repeated validation, it is found that code written by you will indeed remove the space trim. After repeated debugging, there is no problem with modern code, but what makes these data escape businessVerification of code?

Ready to solve the case

Are the'spaces'that I see with my naked eye not the ones we usually see or understand?

With this question, I searched for related problems and found that if not, many people have encountered the invisible character C2 A0, so what is it exactly?

Open the encoding table for UTF-8, https://www.utf8-chartable.de/unicode-utf8-table.pl?utf8=dec Find the corresponding character

First of all, let's make sure what the encoding number C2 A0 represents. Obviously, we just need to convert this hexadecimal to decimal, C2=194 A0=160, which corresponds to

In general, the encoding of spaces is 32

So let's simulate these two characters through code

Normal space Unicode code point is U+0020 or 32

C2 A0 Space Unicode code point is U+00A0 or 160

Once we find out why, we try to get rid of this C2 A0 space

Source code see below

package com.lingyejun.dating.chap11;

import java.nio.charset.StandardCharsets;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class SpecialSpace {

    public static void main(String[] args) {
        String str1 = "lingyejun ";
        byte[] str1Bytes = str1.getBytes();
        String space = new String(str1Bytes, StandardCharsets.UTF_8);
        System.out.println("With 32 Space String of:" + space);
        System.out.println("Use trim Remove 32 -> Space:" + space.trim());

        byte[] str2Bytes = new byte[11];
        System.arraycopy(str1Bytes, 0, str2Bytes, 0, str1Bytes.length);
        str2Bytes[9] = (byte) 0xC2;
        str2Bytes[10] = (byte) 0xA0;
        String noBreakSpace = new String(str2Bytes, StandardCharsets.UTF_8);
        System.out.println("Have C2 A0 -> NO-BREAK SPACE String of:" + noBreakSpace);
        System.out.println("Use trim Unable to remove C2 A0 -> NO-BREAK SPACE:" + noBreakSpace.trim());

        // 32 for the Spacespace we usually talk about - > Space
        byte[] bytes1 = new byte[]{(byte) 0x20};
        String space1 = new String(bytes1, StandardCharsets.UTF_8);
        System.out.println("UTF-8 Character Encoding Number 32 -> 0x1F output:" + space1);

        // 0xC2=194 0xA0=160  -> NO-BREAK SPACE
        byte[] bytes2 = new byte[]{(byte) 0xC2, (byte) 0xA0};
        String space2 = new String(bytes2, StandardCharsets.UTF_8);
        char[] chars3 = space2.toCharArray();
        System.out.println("UTF-8 Character Encoding Number 194 -> 0xC2 160 -> 0xA0 output:" + space2);

        byte[] bytes3 = new byte[]{(byte) 0xC2, (byte) 0xA0};
        String c2a0Space = new String(bytes3, StandardCharsets.UTF_8);
        Pattern p = Pattern.compile(c2a0Space);
        Matcher m = null;
        m = p.matcher(noBreakSpace);
        noBreakSpace = m.replaceAll("");
        System.out.println("Use Regular Removal C2 A0 -> NO-BREAK SPACE:" + noBreakSpace);
    }
}

If it helps you, please don't forget to give Ling Ye Jun some compliments.

Posted by 990805 on Sat, 06 Jun 2020 10:12:56 -0700