This article introduces several new functions for unicode strings in ES6.
String.prototype.codePointAt
Function type:(index?: number)=> number|undefined
codePointAt is a prototype function that returns the code point value of the character in the string according to the passed in index parameter. This method can recognize 4-byte code points in utf-16, and the support range is wider than the prototype function charCodeAt, which can only recognize 2-byte basic plane characters (BMP). In addition, when the index is out of bounds, codePointAt returns undefined and charCodeAt returns NaN.
In addition to these two points, the results of codePointAt and charCodeAt are basically the same:
- The default value of the index parameter is 0
When the character is in the basic plane character set, the results returned by the two are the same.
const str = 'abc'; //The character 'a' is in the basic plane character set console.log(str.codePointAt(0));//97 //The default value of index is 0 console.log(str.codePointAt());//97 //When the index is out of bounds, undefined is returned console.log(str.codePointAt(5));//undefined console.log(str.charCodeAt(0));//97 //The default value of index is 0 console.log(str.charCodeAt());//97 //NaN is returned when index is out of bounds console.log(str.charCodeAt(5));//NaN
When the character is in the auxiliary plane character set, codePointAt can correctly recognize and return the code point of the corresponding character. charCodeAt cannot be recognized correctly. It can only return the code point of 2-byte character in the current position.
For example, for the high pitch character ð of the auxiliary plane, it is represented by two 2-byte basic plane characters 0xd834 and 0xdd1e. When we talk about ð
When using charCodeAt, you can only get the code point at the corresponding position.
const str = '\ud834\udd1e'; //Auxiliary plane character treble character ð console.log(str.charCodeAt(0).toString(16)); //d834 console.log(str.charCodeAt(1).toString(16)); //dd1e
When we use codePointAt, we can get the code point 0x1d11e of ð.
console.log(str.codePointAt(0).toString(16)); //1d11e //When the index is 1, there is no other code unit after '\ udd1e', which is considered to be just a 2-byte character rather than a pair of code units, so only the code point of '\ udd1e' is returned instead of the code point of '\ ud834\udd1e' console.log(str.codePointAt(1).toString(16)); //dd1e
String.fromCodePoint
Function type:
(...codePoints: number[])=> string
The static function fromCodePoint returns the corresponding string according to the passed in unicode code point. Compared with fromCharCode, it supports the code point value directly passed into the auxiliary plane. Taking the treble symbol ð as an example, the code point value 0x1d11e can be directly passed in using fromCodePoint, while the fromCharCode value needs to be passed in 0xd834 and 0xdd1e.
console.log(String.fromCodePoint(0x1d11e)); //ð console.log(String.fromCodePoint(0xd834, 0xdd1e)); //ð console.log(String.fromCharCode(0x1d11e)); //í can't identify correctly, garbled code console.log(String.fromCharCode(0xd834, 0xdd1e)); //ð
For basic plane characters, the results of fromCodePoint and fromCharCode are the same.
console.log(String.fromCodePoint(97)); //'a' console.log(String.fromCodePoint(97, 98)); //'ab' console.log(String.fromCodePoint()); //'' console.log(String.fromCharCode(97)); //'a' console.log(String.fromCharCode(97, 98)); //'ab' console.log(String.fromCharCode()); //''
String.prototype.normalize
Function type:
(form:'NFC'|'NFD'|'NFKC'|'NFKD')=>string
The prototype function normalize accepts a specified Regularization (click if you don't understand the meaning of NFC, NFD, etc.) The default value of the form parameter form is' NFC '(Normalization Form Canonical Composition, which is decomposed in a standard equivalent manner, and then reorganized in a standard equivalent manner), and returns Normalization String of.
unicode provides two ways to express synthetic symbols (letters in characters with additional symbols such as tone). One is to use one unicode code point, and the other is to combine letters in synthetic characters with additional symbols and use two code points, such as Å It is a composite symbol. We can use either one code point 0x0144 or two code points 0x006e and 0x0301.
const str1 = '\u0144'; //Å const str2 = '\u006e\u0301'; //Å console.log({ str1, str2, });//{ str1: 'Å', str2: 'nĖ' }
These two representations are the same visually and semantically, and they are standard equivalent. However, at the code level, they are different. str1 is one code point and str2 is two code points, which may lead to problems.
console.log(str1.length, str2.length);//1 2 console.log(str1 === str2);//false
The normalize function is to solve this problem. The two strings are implemented through the normalize function Normalization After that, it won't happen again.
let str1 = '\u0144'; //Å let str2 = '\u006e\u0301'; //Å //Normalization str1 = str1.normalize(); str2 = str2.normalize(); console.log({ str1, str2, }); //{ str1: 'Å', str2: 'nĖ' } console.log(str1.length, str2.length); //1 1 console.log(str1 === str2); //true
New unicode representation
Previously, we indicated that unicode characters can pass through \ u + code points. ES6 added a new representation, namely \ u + {code points}.
It is also easy to think of the differences between the two methods, \ u + {code point} supports 4-byte code points written to the auxiliary plane, while \ u + code point only supports 2-byte code points of the basic plane.
//For 2-byte code points in the basic plane, there is no difference between the two const str1 = '\u{0144}'; const str2 = '\u0144'; console.log(str1 === str2); //true //Treble symbol const str3 = '\u{1d11e}'; //The representation of the error is recognized as two characters \ u1d11 and e const str4 = '\u1d11e'; console.log(str4,str3===str4); //áīe false
unicode is really a headache. If a friend doesn't know much about unicode, you can leave a message in the comment area. I'll send another article detailing unicode and JS.