Name: proposal-regexp-unicode-sequence-properties
Owner: Ecma TC39
Description: Proposal to add support for sequence properties in Unicode property escapes to ECMAScript regular expressions.
Created: 2018-02-18 17:12:55.0
Updated: 2018-05-24 14:13:40.0
Pushed: 2018-05-24 14:16:15.0
Homepage: https://tc39.github.io/proposal-regexp-unicode-sequence-properties/
Size: 42
Language: HTML
GitHub Committers
User | Most Recent Commit | # Commits |
---|
Other Committers
User | Most Recent Commit | # Commits |
---|
This proposal is at stage 1 of the TC39 process.
The Unicode Standard assigns various properties and property values to every symbol. For example, to get the set of symbols that are used exclusively in the Greek script, search the Unicode database for symbols whose Script
property is set to Greek
.
Unicode property escapes enable JavaScript developers to access these Unicode character properties natively in ECMAScript regular expressions.
t regexGreekSymbol = /\p{Script=Greek}/u;
xGreekSymbol.test('?');
true
The Unicode properties and values that are currently supported in Unicode property escapes have something in common: they all expand to a list of code points. Such escapes can be transpiled as a character class containing the list of code points they match individually. For example, \p{ASCII_Hex_Digit}
is equivalent to [0-9A-Fa-f]
: it only ever matches a single Unicode symbol at a time.
However, the Unicode Standard defines properties that instead expand to a list of sequences of code points. In regular expressions, such properties translate to a set of alternatives. To illustrate this, imagine a Unicode property that expands to the Unicode code point sequences 'a'
, 'mn'
, and 'xyz'
. This property translates to the following regular expression pattern: a|mn|xyz
. Note how unlike existing Unicode property escapes, this pattern can match multiple Unicode symbols.
A minimal actual example of such a Unicode property is Emoji_Keycap_Sequence
. To represent this property in a regular expression, one could use the pattern \x23\uFE0F\u20E3|\x2A\uFE0F\u20E3|\x30\uFE0F\u20E3|\x31\uFE0F\u20E3|\x32\uFE0F\u20E3|\x33\uFE0F\u20E3|\x34\uFE0F\u20E3|\x35\uFE0F\u20E3|\x36\uFE0F\u20E3|\x37\uFE0F\u20E3|\x38\uFE0F\u20E3|\x39\uFE0F\u20E3
. Regular expressions for these properties suffer from the same issues that Unicode property escapes solve: they?re hard to write or maintain manually, they tend to be large, and they?re unreadable. (The Emoji_Keycap_Sequence
pattern in particular can be simplified as [\x23\x2A0-9]\uFE0F\u20E3
, but even in that form it?s hard to decipher.)
We propose the addition of Unicode sequence properties to the existing Unicode property escapes syntax.
With this feature, the above regular expression could be written as:
t regexEmojiKeycap = /\p{Emoji_Keycap_Sequence}/u;
xEmojiKeycap.test('4??');
true
We propose to support the following Unicode sequence properties defined in Unicode TR51:
Emoji_Combining_Sequence
Emoji_Flag_Sequence
Emoji_Keycap_Sequence
Emoji_Modifier_Sequence
Emoji_Tag_Sequence
Emoji_ZWJ_Sequence
Re-using the existing Unicode property escapes syntax for this new functionality seems appropriate:
\p{UnicodeSequencePropertyName}
The negated \P{?}
form is not supported for sequence properties as it would be a footgun. It?s not generally useful, and is better expressed as a negative lookahead. Compare the unsupported /\P{UnicodeSequenceProperty}/u
(what should it do?) with /(?!\p{UnicodeSequenceProperty})/u
(clear what it does).
Given that UnicodeSequencePropertyName
expands to a list of sequences of Unicode code points, the proposal includes a static restriction that bans such properties within character classes.
Unicode property escapes for unsupported Unicode properties throw an early SyntaxError
. As such, we can add support for new properties in a backwards-compatible way, as long as we re-use the existing syntax.
Currently, each property escape expand to a list of code points. As such, their meaning is clear and unambiguous, even within a character class. For example, the following regular expression matches either a Letter, a Number, or an underscore:
t re = /[\p{Letter}\p{Number}_]/u;
For the new properties introduced by this proposal, the expected behavior within character classes is unclear. A character class, when matched, always produces only a single character. Allowing sequence properties within character classes would change that, for no good reason.
t re = /[\p{Emoji_Flag_Sequence}_a-z]/u;
? What should this do?
f the goal is to match either `\p{Emoji_Flag_Sequence}` or `_` or
[a-z]`, one could still use `|`:
t re = /\p{Emoji_Flag_Sequence}|[a-z_]/u;
To avoid confusion, the proposal throws a SyntaxError
exception when sequence properties are used within character classes.
Per UTR51 ED-26, the term ?emoji sequences? refers to emoji flag sequences, emoji tag sequences, and emoji ZWJ sequences. With this proposal, emoji sequences can be represented as a RegExp pattern in JavaScript:
t reEmojiSequence = /\p{Emoji_Flag_Sequence}|\p{Emoji_Tag_Sequence}|\p{Emoji_ZWJ_Sequence}/u;
This proposal makes it possible to match all emoji, regardless of whether they consist of sequences or not:
t reEmoji = /\p{Emoji_Modifier_Base}\p{Emoji_Modifier}?|\p{Emoji_Presentation}|\p{Emoji}\uFE0F|\p{Emoji_Flag_Sequence}|\p{Emoji_Tag_Sequence}|\p{Emoji_ZWJ_Sequence}/gu;
This regular expression matches, from left to right:
\p{Emoji_Modifier_Base}\p{Emoji_Modifier}?
per ED-13);\p{Emoji_Presentation}
per ED-6);\p{Emoji}\uFE0F
per ED-9a);\p{Emoji_Flag_Sequence}|\p{Emoji_Tag_Sequence}|\p{Emoji_ZWJ_Sequence}
, as discussed above).An equivalent regular expression without the use of property escapes is ~7 KB in size. With property escapes, but without sequence property support, the size is still ~4.5 KB. The abovementioned regular expression with sequence properties takes up 155 bytes.
Unicode® Standard Annex #31 defines hashtag identifiers in two forms.
The Default Hashtag Identifier Syntax (UAX31-D2) translates to the following JavaScript regular expression:
t reHashtag = /[#\uFF03]\p{XID_Continue}+/u;
However, the Extended Hashtag Identifier Syntax (UAX31-R8) currently cannot trivially be expressed as a JavaScript regular expression, as it includes emoji. An approximation without emoji sequence support would be:
his matches *some* emoji, but not those consisting of sequences.
t reHashtag = /[#\uFF03][\p{XID_Continue}_\p{Emoji}]+/u;
The above pattern matches some emoji, but not those consisting of sequences. With the proposed feature however, fully implementing the UAX31-R8 syntax becomes feasible:
t reHashtag = /[#\uFF03](?:[\p{XID_Continue}_\p{Emoji}]|\p{Emoji_Flag_Sequence}|\p{Emoji_Tag_Sequence}|\p{Emoji_ZWJ_Sequence})+/u;
An equivalent regular expression without the use of property escapes is ~12 KB in size. With property escapes, but without sequence property support, the size is still ~3 KB. The abovementioned regular expression with sequence properties takes up 115 bytes.