⑩ Unicode Characters You Should Know About as a 👩‍💻




@JanLelis / November 2017 / RubyConf

idiosyncratic-ruby.com

Documenting lesser-known features in Ruby

① U+0E01, U+0E33

กำ

THAI CHARACTER KO KAI,
THAI CHARACTER SARA AM

Unicode provides a unique number
for every character
,

no matter what the platform,

no matter what the program,

no matter what the language.

Before There Was Unicode…

What is a Character in Unicode?

Codepoint: A number mapped to some meaning
(often a single character)

Codepoints range from 0 to 0x10FFFF (= 1114111)

A character can be one or multiple codepoints

A character is officially called grapheme cluster

Grapheme Clusters

= a user-perceived character

= a single unit


Unicode® Standard Annex #29
Unicode Text Segmentation

Grapheme Clusters ✔

Regexes can match for grapheme clusters with /\X/

"abกำcd".scan(/./) #=> ["a", "b", "ก", "ำ", "c", "d"]
"abกำcd".scan(/\X/) #=> ["a", "b", "กำ", "c", "d"]

Ruby 2.5: String#grapheme_clusters

"abกำcd".grapheme_clusters #=> ["a", "b", "กำ", "c", "d"]

② U+0041, U+0308

Ä

LATIN CAPITAL LETTER A,
COMBINING DIAERESIS

Combined Character?

Letter Ä Codepoint Analysis

Normalization

Unicode® Standard Annex #15:
Unicode Normalization Forms

Turns characters into canonical versions

Normalization ✔

Part of standard library: unicode_normalize

Is automatically required by Ruby

"Ä".unicode_normalize == "Ä".unicode_normalize # => true

③ U+006F

ο

LATIN SMALL LETTER O

U+006F?
U+03BF!

ο

GREEK SMALL LETTER OMICRON

U+006F

o

LATIN SMALL LETTER O

Letter o Codepoint Analysis

The letter o has 75 confusables:

ం ಂ ം ං ० ੦ ૦ ௦ ౦ ೦ ൦ ๐ ໐ ၀ ‎٥‎ ۵ o ℴ 𝐨 𝑜 𝒐 𝓸 𝔬 𝕠 𝖔 𝗈 𝗼 𝘰 𝙤 𝚘 ᴏ ᴑ ꬽ ο 𝛐 𝜊 𝝄 𝝾 𝞸 σ 𝛔 𝜎 𝝈 𝞂 𝞼 ⲟ о ჿ օ ‎ס‎ ‎ه‎ ‎𞸤‎ ‎𞹤‎ ‎𞺄‎ ‎ﻫ‎ ‎ﻬ‎ ﻪ‎ ‎ﻩ‎ ‎ھ‎ ‎ﮬ‎ ‎ﮭ‎ ‎ﮫ‎ ‎ﮪ‎ ‎ہ‎ ‎ﮨ‎ ‎ﮩ‎ ‎ﮧ‎ ‎ﮦ‎ ‎ە‎ ഠ ဝ 𐓪 𑣈 𑣗 𐐬

Confusables: More Examples

?? and ⁇

1 and l

C and C

Confusables ✔

Unicode® Technical Standard #39 to the rescue


unicode-confusable gem

Unicode::Confusable.confusable? "ℜ𝘂ᖯʏ", "Ruby" # => true

④ U+0304

Ä°

LATIN CAPITAL LETTER I WITH DOT ABOVE

Case-Mapping

i → I

Seems correct…


…for most languages, but what about Turkic languages?

i → İ

Language-dependent!

Case-Mapping

"i".upcase #=> "I"


Before Ruby 2.4:

"ä".upcase #=> "ä"


Since Ruby 2.4:

"ä".upcase #=> "Ä"

"ä".upcase(:ascii) #=> "ä"

"i".upcase(:turkic) #=> "Ä°"

Case-Mapping ✔

Your toolkit: String#upcase, String#downcase, String#swapcase, and String#capitalize take a language context parameter:

:ascii

:turkic

:lithuanian

Case-Folding

Special way of downcasing a String

Meant for comparing strings

Uses a different mapping algorithm

"ẞ".downcase #=> "ß"
"ẞ".downcase(:fold) #=> "ss"

Caveats ✗

:lithuanian not working yet

What about dutch? ijsland → IJsland

String#casecmp? uses case-folding
String#casecmp only use ASCII

⑤ U+0085

␤

NEXT LINE

Next Line Control Character

Collection of ways to create new lines:

U+000A # (Line Feed)
U+000D #(Carriage Return)
U+000D, U+000A # (Carriage Return + Line Feed)

To solve this confusion U+0085 was introduced

U+0085 # (Next Line)

Next Line Control Character

Control Characters

Control Characters

C0: U+0000..U+001F

␀ ␁ ␂ ␃ ␄ ␅ ␆ ␇ ␈ ␉ ␊ ␋ ␌ ␍ ␎ ␏
␐ ␑ ␒ ␓ ␔ ␕ ␖ ␗ ␘ ␙ ␚ ␛ ␜ ␝ ␞ ␟

Also C0: U+007F (Backspace)

␡

C1: U+0080..U+009F

PAD HOP BPH NBH IND NEL SSA ESA HTS HTJ VTS PLD PLU RI SS2 SS3
DCS PU1 PU2 STS CCH MW SPA EPA SOS SGC SCI CSI ST OSC PM APC

Control Characters ✔

␤ is included in /\p{space}/ matches

Control Characters are matched with /\p{Cc}/

Or use characteristics gem

Characteristics.create("\u{80}").c0? #=> false
Characteristics.create("\u{80}").c1? #=> true

⑥ U+2E3B

⸻

THREE-EM DASH

Three EM Dash on the Terminal

Three EM Dash on Twitter

Three EM Dash on Twitter

Other Wide Characters


𒐫 ﷽

‱     𒈙

Character Width

Not well defined

Troublesome in fixed-width environments

Character Width: East Asian Width

EastAsianWidth.txt assignes double width to many Asian characters

However, one character category is "ambiguous"

Ambiguous characters can be single or double-width

Character Width ✔

unicode-display_width gem

Unicode::DisplayWidth.of("⚀") # => 1
Unicode::DisplayWidth.of("一") # => 2
Unicode::DisplayWidth.of("·", 1) # => 1
Unicode::DisplayWidth.of("·", 2) # => 2

⑦ U+D800

�

<surrogate-D800>

Invalid Codepoints


1) UTF-16 Surrogates

Any codepoint in the UTF-16 Surrogates Area: U+D800..U+DFFF
—
Even though UTF-8 and UTF-32 could represent them

2) Too Large

Any codepoint > U+10FFFF (= 1 114 111)
—
UTF-32 maximum is U+FFFFFFFF (= 4 294 967 295)
UTF-8 maximum, 4 bytes is U+1FFFFF (= 2 097 151)

Invalid Codepoints

Invalid Codepoints

Ruby does not let you create these from literals

"\u{D800}"
SyntaxError: (irb):52: invalid Unicode codepoint
"\u{110000}"
SyntaxError: (irb):54: invalid Unicode codepoint (too large)

If you really need to…

[0xD800].pack("U") #=> "\xED\xA0\x80"
[0x110000].pack("U") #=> "\xF4\x90\x80\x80"

Invalid Codepoints ✔


String#valid_encoding?
Checks if encoding contains invalid codepoints/bytes


String#scrub
Replaces invalid codepoints/bytes with U+FFFD (�)

⑧ U+10FFFF

n/c

<noncharacter-10FFFF>

Unassigned Codepoints


1) Non-Characters 66

Range of U+FDD0..U+FDEF and the last two codepoints of each plane: U+XFFFE, U+XFFFF


2) Private-Use Codepoints 137 468

 (U+F0FF)

 (U+F8FF)


3) Not-Yet Assigned Codepoints 837 775

Codepoint Distribution

Unassigned Codepoints ✔


1) Non-Characters

/\p{non character codepoint}/


2) Private-Use Codepoints

/\p{private use}/


3) Not-Yet Assigned Codepoints

/\p{unassigned}(?<!\p{non character codepoint})/

⑨ U+00A0

] [

NO-BREAK SPACE

Invisible Characters

There are tons of non-visible characters in Unicode

Only some are considered as whitespace

Invisible Characters

Invisible Characters

More examples for "blank" codepoints

U+180E

]᠎[

MONGOLIAN VOWEL SEPARATOR

U+2003

] [

EM SPACE

U+200C

]‌[

ZERO WIDTH JOINER

U+2064

]⁤[

INVISIBLE PLUS

U+2800

]⠀[

BRAILLE PATTERN BLANK

U+3000

] [

IDEOGRAPHIC SPACE

U+1D159

]𝅙[

MUSICAL SYMBOL NULL NOTEHEAD

Invisible: Ignorables

Group of characters which render nothing

The whole range of E0000..E0FFF is ignorable!

Invisible: Variation Selectors

U+180B..U+180D
U+FE00..0+FE0F
U+E0100..U+E01EF

259 invisible and ignorable codepoints

Used to trigger a visual variation on preceding character

U+FE0E makes some text-based emoji switch to image-based ones

U+FE0F the other way around

Invisible: Tags

U+E0001
U+E0020..U+E007F

Allow creation of language tag sequences,
which are deprecated

Invisible: Tags

Invisible ✔

Not all invisble characters are whitespaces

Some are matched by \s

Some more are matched by \p{space}

Ignorables are matched by
/\p{default ignorable code point}/

Some are not matched by any property

characteristics gem

Characteristics.create(" ").blank? # => true
Characteristics.create(" ").ignorable? # => false

⑩ U+1F468, U+1F3FB, U+0200D, U+1F373

👨🏻‍🍳


MAN COOK: LIGHT SKIN TONE

U+1F468, U+1F3FB, U+0200D, U+1F373 (on Twitter)

MAN COOK: LIGHT SKIN TONE

Emoji

Seven (!) different concepts of Emoji creation

  1. Single-codepoint Emoji
  2. Single-codepoint Emoji with Emoji Presentation Selector (U+FE0E)
  3. Base Emoji with a Skin-Tone Modifier
  4. Keycap sequences
  5. Zero-Width Joiner sequences
  6. Region flags
  7. Sub-region flags

Emoji: ZWJ Sequences

Allows you to combine multiple emoji using U+200C

It is valid to join all kinds of emoji…

…but only the ones from the Emoji Standard are recommended

Emoji: Region Flags

"Portugal" 🇵🇹

Emoji: Sub-Region Flags

Tag Sequences are deprecated

…but let's use Tag Characters for Scotish flags!

Emoji: Sub-Region Flags

"Scotland" 🏴󠁧󠁢󠁳󠁣󠁴󠁿

Emoji ✔

unicode-emoji gem

"String with Emoji like 🛌🏽 or 🤾🏽‍♀️ or 🏴󠁧󠁢󠁳󠁣󠁴󠁿".scan \
Unicode::Emoji::REGEX # => ["🛌🏽", "🤾🏽‍♀️", "🏴󠁧󠁢󠁳󠁣󠁴󠁿"]
Unicode::Emoji.list
# => {"Smileys & People"
#   =>{"face-positive"
#      =>["😀", "😁", "😂" …

Summary

① Graphemes
\X or String#each_grapheme_cluster


② Normalization
unicode_normalize (stdllib)


③ Confusables
unicode-confusable (gem)


Summary

④ Case-Mapping
Built in, optional :turkic option


⑤ Control Characters
characteristics (gem)


⑥ Display Width
unicode-display_width (gem)


Summary

⑦ Invalid
String#valid_encoding?, String#scrub


⑧ Unassigned
\p{unassigned}, \p{private use}


⑨ Invisible
characteristics (gem)


⑩ Emoji
unicode-emoji (gem)

Thanks!

@JanLelis / idiosyncratic-ruby.com