ES6 added API: String

Keywords: Javascript Front-end ECMAScript

This article introduces several new functions for unicode strings in ES6.

  1. String.prototype.codePointAt
    Function type:

    (index?: number)=> number|undefined

    codePointAt is a prototype function that returns the code point value of the character in the string according to the passed in index parameter. This method can recognize 4-byte code points in utf-16, and the support range is wider than the prototype function charCodeAt, which can only recognize 2-byte basic plane characters (BMP). In addition, when the index is out of bounds, codePointAt returns undefined and charCodeAt returns NaN.

    In addition to these two points, the results of codePointAt and charCodeAt are basically the same:

    • The default value of the index parameter is 0
    • When the character is in the basic plane character set, the results returned by the two are the same.

      const str = 'abc'; //The character 'a' is in the basic plane character set
      console.log(str.codePointAt(0));//97
      //The default value of index is 0
      console.log(str.codePointAt());//97
      //When the index is out of bounds, undefined is returned
      console.log(str.codePointAt(5));//undefined
      console.log(str.charCodeAt(0));//97
      //The default value of index is 0
      console.log(str.charCodeAt());//97
      //NaN is returned when index is out of bounds
      console.log(str.charCodeAt(5));//NaN
  • When the character is in the auxiliary plane character set, codePointAt can correctly recognize and return the code point of the corresponding character. charCodeAt cannot be recognized correctly. It can only return the code point of 2-byte character in the current position.

    For example, for the high pitch character 𝄞 of the auxiliary plane, it is represented by two 2-byte basic plane characters 0xd834 and 0xdd1e. When we talk about 𝄞

    When using charCodeAt, you can only get the code point at the corresponding position.

    const str = '\ud834\udd1e'; //Auxiliary plane character treble character 𝄞
    console.log(str.charCodeAt(0).toString(16)); //d834 
    console.log(str.charCodeAt(1).toString(16)); //dd1e

    When we use codePointAt, we can get the code point 0x1d11e of 𝄞.

    console.log(str.codePointAt(0).toString(16)); //1d11e
    //When the index is 1, there is no other code unit after '\ udd1e', which is considered to be just a 2-byte character rather than a pair of code units, so only the code point of '\ udd1e' is returned instead of the code point of '\ ud834\udd1e'
    console.log(str.codePointAt(1).toString(16)); //dd1e
  1. String.fromCodePoint

    Function type:

    (...codePoints: number[])=> string

    The static function fromCodePoint returns the corresponding string according to the passed in unicode code point. Compared with fromCharCode, it supports the code point value directly passed into the auxiliary plane. Taking the treble symbol 𝄞 as an example, the code point value 0x1d11e can be directly passed in using fromCodePoint, while the fromCharCode value needs to be passed in 0xd834 and 0xdd1e.

    console.log(String.fromCodePoint(0x1d11e)); //𝄞
    console.log(String.fromCodePoint(0xd834, 0xdd1e)); //𝄞
    console.log(String.fromCharCode(0x1d11e)); //턞 can't identify correctly, garbled code
    console.log(String.fromCharCode(0xd834, 0xdd1e)); //𝄞

    For basic plane characters, the results of fromCodePoint and fromCharCode are the same.

    console.log(String.fromCodePoint(97)); //'a'
    console.log(String.fromCodePoint(97, 98)); //'ab'
    console.log(String.fromCodePoint()); //''
    console.log(String.fromCharCode(97)); //'a'
    console.log(String.fromCharCode(97, 98)); //'ab'
    console.log(String.fromCharCode()); //''
  2. String.prototype.normalize

    Function type:

    (form:'NFC'|'NFD'|'NFKC'|'NFKD')=>string

    The prototype function normalize accepts a specified Regularization (click if you don't understand the meaning of NFC, NFD, etc.) The default value of the form parameter form is' NFC '(Normalization Form Canonical Composition, which is decomposed in a standard equivalent manner, and then reorganized in a standard equivalent manner), and returns Normalization String of.

    unicode provides two ways to express synthetic symbols (letters in characters with additional symbols such as tone). One is to use one unicode code point, and the other is to combine letters in synthetic characters with additional symbols and use two code points, such as ń It is a composite symbol. We can use either one code point 0x0144 or two code points 0x006e and 0x0301.

    const str1 = '\u0144'; //ń
    const str2 = '\u006e\u0301'; //ń
    console.log({
        str1,
        str2,
    });//{ str1: 'ń', str2: 'nĖ' }

    These two representations are the same visually and semantically, and they are standard equivalent. However, at the code level, they are different. str1 is one code point and str2 is two code points, which may lead to problems.

    console.log(str1.length, str2.length);//1 2
    console.log(str1 === str2);//false

    The normalize function is to solve this problem. The two strings are implemented through the normalize function Normalization After that, it won't happen again.

    let str1 = '\u0144'; //ń
    let str2 = '\u006e\u0301'; //ń
    //Normalization
    str1 = str1.normalize();
    str2 = str2.normalize();
    console.log({
        str1,
        str2,
    }); //{ str1: 'ń', str2: 'nĖ' }
    
    console.log(str1.length, str2.length); //1 1
    console.log(str1 === str2); //true
  3. New unicode representation

    Previously, we indicated that unicode characters can pass through \ u + code points. ES6 added a new representation, namely \ u + {code points}.

    It is also easy to think of the differences between the two methods, \ u + {code point} supports 4-byte code points written to the auxiliary plane, while \ u + code point only supports 2-byte code points of the basic plane.

    //For 2-byte code points in the basic plane, there is no difference between the two
    const str1 = '\u{0144}';
    const str2 = '\u0144';
    console.log(str1 === str2); //true
    //Treble symbol
    const str3 = '\u{1d11e}';
    //The representation of the error is recognized as two characters \ u1d11 and e
    const str4 = '\u1d11e';
    console.log(str4,str3===str4); //áī‘e false

    unicode is really a headache. If a friend doesn't know much about unicode, you can leave a message in the comment area. I'll send another article detailing unicode and JS.

Posted by thetechgeek on Mon, 29 Nov 2021 04:43:24 -0800