Intl.Segmenter:JavaScriptでのUnicodeセグメンテーション

翻訳の序文



これは提案の説明部分の翻訳でありIntl.Segmenter、次のECMAScript仕様に追加される可能性があります。



プロポーザルはすでにV8で実装されており、フラグなしでバージョン8.7(より正確には8.7.38、以上)で使用できるため、Google Chrome Canary(バージョン以降87.0.4252.0)またはNode.js V8 Canary(バージョン以降v15.0.0-v8-canary202009025a2ca762b8; Windows用のバイナリが利用可能でテストできます。 v15.0.0-v8-canary202009173b56586162)。



以前のバージョンでフラグを使用してテストする場合は--harmony-intl-segmenter、仕様が変更され、フラグの下の実装が古くなっている可能性があるため、注意してください。コード例の出力で確認してください。



翻訳後、この提案が解決する問題の理由に関する資料へのリンクが提供されます。






Intl.Segmenter:JavaScriptでのUnicodeセグメンテーション



提案は、リチャードギブソンの支援を受けてステージ3にあります。



動機



(code point) «» . , (, ). , . , .



, CLDR (Common Locale Data Repository, ) (, locales). , , , .



, UAX 29. , JavaScript .



Chrome API Intl.v8BreakIterator. API . API, API JavaScript — , ES2015.







, segment(), Intl.Segmenter, Iterable.



//      .
let segmenter = new Intl.Segmenter("fr", {granularity: "word"});

//       .
let input = "Moi?  N'est-ce pas.";
let segments = segmenter.segment(input);

//    !
for (let {segment, index, isWordLike} of segments) {
  console.log("segment at code units [%d, %d): «%s»%s",
    index, index + segment.length,
    segment,
    isWordLike ? " (word-like)" : ""
  );
}

//  console.log:
// segment at code units [0, 3): «Moi» (word-like)
// segment at code units [3, 4): «?»
// segment at code units [4, 6): «  »
// segment at code units [6, 11): «N'est» (word-like)
// segment at code units [11, 12): «-»
// segment at code units [12, 14): «ce» (word-like)
// segment at code units [14, 15): « »
// segment at code units [15, 18): «pas» (word-like)
// segment at code units [18, 19): «.»


, API .



// ┃0 1 2 3 4 5┃6┃7┃8┃9
// ┃A l l o n s┃-┃y┃!┃
let input = "Allons-y!";

let segmenter = new Intl.Segmenter("fr", {granularity: "word"});
let segments = segmenter.segment(input);
let current = undefined;

current = segments.containing(0)
// → { index: 0, segment: "Allons", isWordLike: true }

current = segments.containing(5)
// → { index: 0, segment: "Allons", isWordLike: true }

current = segments.containing(6)
// → { index: 6, segment: "-", isWordLike: false }

current = segments.containing(current.index + current.segment.length)
// → { index: 7, segment: "y", isWordLike: true }

current = segments.containing(current.index + current.segment.length)
// → { index: 8, segment: "!", isWordLike: false }

current = segments.containing(current.index + current.segment.length)
// → undefined


API



.



new Intl.Segmenter(locale, options)



.



options , granularity, ("grapheme" ( ), "word" ( ) "sentence" ( ); — "grapheme").



Intl.Segmenter.prototype.segment(string)



%Segments% Iterable .





:



  • segment — .
  • index — (code unit index) , .
  • input — .
  • isWordLiketrue, "word" ( ) ( /// ..); false, "word" ( // ..); undefined, "word".


%Segments%.prototype:



%Segments%.prototype.containing(index)



, , (code unit) , undefined, .



%Segments%.prototype[Symbol.iterator]



%SegmentIterator%, "" (lazy, ) , .



%SegmentIterator%.prototype:



%SegmentIterator%.prototype.next()



next() Iterator, IteratorResult, value , .



FAQ



? ?



— , . . . CLDR. , CLDR/ICU , .



API ?



, 3- , . TC39 . ; , , .



?



API, , API : , API (, ). API CSS Houdini.



?



API:



  • .
  • .
  • , (.. Web API (Web Platform), ECMAScript).
  • , . CLDR ICU . CSS, . . , , , ; .


?



%SegmentIterator%.prototype, (, seek([inclusiveStartIndex = thisIterator.index + 1]) seekBefore([exclusiveLastIndex = thisIterator.index]), . ECMA-262 ( ). , , .



API Intl, String?



, . segment() SegmentIterator. , API Intl, ECMA-402. , . String, , .



?



n (code unit), . , "Hello, world\u{1F499}" ( , - — ), 0, 5, 6, 7 12. : ┃Hello┃,┃ ┃world┃\u{1F499}┃, (code units), (code point). , .



?



, next().



, ?



, - QA ;)



Number: null 0, — 0 1, , , Symbol BigInt, undefined NaN *. , ( , ).



* . "fail". Chrome Canary, Symbol BigInt TypeError, undefined NaN , 0.








JavaScript.



  1. Joel Spolsky. The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
  2. Dmitri Pavlutin. What every JavaScript developer should know about Unicode
  3. Dr. Axel Rauschmayer. JavaScript for impatient programmers: 17. Unicode – a brief introduction
  4. Dr. Axel Rauschmayer. JavaScript for impatient programmers: 18.6. Atoms of text: Unicode characters, JavaScript characters, grapheme clusters
  5. Jonathan New. "\u{1F4A9}".length === 2
  6. Nicolás Bevacqua. ES6 Strings (and Unicode, ) in Depth
  7. Mathias Bynens. JavaScript has a Unicode problem
  8. Mathias Bynens. Unicode-aware regular expressions in ECMAScript 6
  9. Mathias Bynens. Unicode property escapes in JavaScript regular expressions
  10. Mathias Bynens. Unicode sequence property escapes
  11. Awesome Unicode: a curated list of delightful Unicode tidbits, packages and resources



All Articles