tc39/proposal-intl-segmenter

Name: proposal-intl-segmenter

Owner: Ecma TC39

Description: Unicode text segmentation for ECMAScript

Created: 2016-08-26 05:40:33.0

Updated: 2018-05-01 15:38:25.0

Pushed: 2018-05-01 15:39:14.0

Homepage: https://tc39.github.io/proposal-intl-segmenter/

Size: 45

Language: HTML

GitHub Committers

UserMost Recent Commit# Commits

Other Committers

UserEmailMost Recent Commit# Commits

README

Intl.Segmenter: Unicode segmentation in JavaScript

Stage 2 proposal, champion Daniel Ehrenberg (Igalia)

Motivation

A code point is not a “letter” or a displayed unit on the screen. That designation goes to the grapheme, which can consist of multiple code points (e.g., including accent marks, conjoining Korean characters). Unicode defines a grapheme segmentation algorithm to find the boundaries between graphemes. This may be useful in implementing advanced editors/input methods, or other forms of text processing.

Unicode also defines an algorithm for finding breaks between words and sentences, which CLDR tailors per locale. These boundaries may be useful, for example, in implementing a text editor which has commands for jumping or highlighting words and sentences. There is an analogous algorithm for opportunities for line breaking.

Grapheme, word and sentence segmentation is defined in UAX 29. Line breaking is defined in UAX 14. Web browsers need an implementation of both kinds of segmentation to function, and shipping it to JavaScript saves memory and network bandwidth as compared to expecting developers to implement it themselves in JavaScript.

Chrome has been shipping its own nonstandard segmentation API called Intl.v8BreakIterator for a few years. However, for a few reasons, this API does not seem suitable for standardization. This explainer outlines a new API which attempts to be more in accordance with modern, post-ES2015 JavaScript API design.

Example
reate a segmenter in your locale
segmenter = new Intl.Segmenter("fr", {granularity: "word"});

et an iterator over a string
iterator = segmenter.segment("Ceci n'est pas une pipe");

terate over it!
(let {segment, breakType} of iterator) {
nsole.log(`segment: ${segment} breakType: ${breakType}`);
eak;


ogs the following to the console:
egment: Ceci breakType: letter
API

polyfill for a historical snapshot of this proposal

new Intl.Segmenter(locale, options)

Interpretation of options:

Intl.Segmenter.prototype.segment(string)

This method creates a new %SegmentIterator% over the input string, which will lazily find breaks, starting at position 0.

%SegmentIterator%

This class iterates over segment boundaries of a particular string.

Methods on %SegmentIterator%:
%SegmentIterator%.prototype.next()

The next method, to use finds the next boundary and returns an IterationResult, where the value is an object with fields segment and breakType. The segment contains the substring between the previous break location and the newly found break location; the breakType describes which sort of segment it is (TODO: define possible values, not part of UTS). This method defines the iteration protocol support for SegmentIterators, and is present for convenience; other methods expose a richer API.

%SegmentIterator%.prototype.following(index)

Move the iterator to the next break position after the given code unit index index, or if no index is provided, after its current position. Returns true if the end of the string was reached.

%SegmentIterator%.prototype.preceding(index)

Move the iterator to the prevoius break position before the given code unit index index, or if no index is provided, before its current position. Returns true if the beginning of the string was reached.

get %SegmentIterator%.prototype.position

Return the index of the most recently discovered break position, as an offset from the beginning of the string. Initially the position is 0.

get %SegmentIterator%.prototype.breakType

The breakType of the most recently discovered segment. If there is no current segment (e.g., a just-instantiated SegmentIterator, or one which has reached the end), or if the break type is “grapheme”, then this will be undefined.

For most programmers, the most important differences may be

FAQ

Q: Why should we pass a locale and options bag for grapheme breaks? Isn't there just one way to do it?

A: The situation is a little more complicated, e.g., for Indic scripts. Work is ongoing to support grapheme break options for these scripts better; see this bug, and in particular this CLDR wiki page. Seems like CLDR/ICU don't support this yet, but it's planned.

Q: Shouldn't we be putting new APIs in built-in modules?

A: If built-in modules had come out before this gets to Stage 3, that sounds like a good option. However, so far the idea in TC39 has been not to block either thing on the other. Built-in modules still have some big questions to resolve, e.g., how/whether polyfills should interact with them.

Q: Why is hyphenation not included?

A: Hyphenation is expected to have a different sort of API shape for various reasons:

Q: Why is this API stateful?

It would be possible to make a stateless API without a SegmentIterator, where instead, a Segmenter has two methods, with two arguments: a string and an offset, for finding the next break before or after. This method would return an object {breakType, position} similar to what next() returns in this API. However, there are a few downsides to this approach:

It is easy to create a stateless API based on this stateful one, or vice versa, in user JavaScript code.

Q: Why is this an Intl API instead of String methods?

A: All of these break types are actually locale-dependent, and some allow complex options. The result of the segment method is a SegmentIterator. For many non-trivial cases like this, analogous APIs are put in ECMA-402's Intl object. This allows for the work that happens on each instantiation to be shared, improving performance. We could make a convenience method on String as a follow-on proposal.


This work is supported by the National Institutes of Health's National Center for Advancing Translational Sciences, Grant Number U24TR002306. This work is solely the responsibility of the creators and does not necessarily represent the official views of the National Institutes of Health.