Documenting lesser-known features in Ruby
กำ
THAI CHARACTER KO KAI,
THAI CHARACTER SARA AM
Codepoint: A number mapped to some meaning
(often a single character)
Codepoints range from 0
to 0x10FFFF
(= 1114111)
A character can be one or multiple codepoints
A character is officially called grapheme cluster
= a user-perceived character
= a single unit
Unicode® Standard Annex #29
Unicode Text Segmentation
Regexes can match for grapheme clusters with /\X/
"abกำcd".scan(/./) #=> ["a", "b", "ก", "ำ", "c", "d"]
"abกำcd".scan(/\X/) #=> ["a", "b", "กำ", "c", "d"]
Ruby 2.5: String#grapheme_clusters
"abกำcd".grapheme_clusters #=> ["a", "b", "กำ", "c", "d"]
Ä
LATIN CAPITAL LETTER A,
COMBINING DIAERESIS
Unicode® Standard Annex #15:
Unicode Normalization Forms
Turns characters into canonical versions
Part of standard library: unicode_normalize
Is automatically required by Ruby
"Ä".unicode_normalize == "Ä".unicode_normalize # => true
ο
LATIN SMALL LETTER O
ο
GREEK SMALL LETTER OMICRON
o
LATIN SMALL LETTER O
??
and ⁇
1
and l
C
and C
Unicode® Technical Standard #39 to the rescue
unicode-confusable gem
Unicode::Confusable.confusable? "ℜ𝘂ᖯʏ", "Ruby" # => true
İ
LATIN CAPITAL LETTER I WITH DOT ABOVE
i → I
Seems correct…
…for most languages, but what about Turkic languages?
i → İ
Language-dependent!
"i".upcase #=> "I"
Before Ruby 2.4:
"ä".upcase #=> "ä"
Since Ruby 2.4:
"ä".upcase #=> "Ä"
"ä".upcase(:ascii) #=> "ä"
"i".upcase(:turkic) #=> "İ"
Your toolkit: String#upcase, String#downcase, String#swapcase, and String#capitalize take a language context parameter::ascii
:turkic
:lithuanian
Special way of downcasing a String
Meant for comparing strings
Uses a different mapping algorithm
"ẞ".downcase #=> "ß"
"ẞ".downcase(:fold) #=> "ss"
:lithuanian
not working yet
What about dutch? ijsland → IJsland
String#casecmp? uses case-folding
String#casecmp only use ASCII

NEXT LINE
Collection of ways to create new lines:
U+000A # (Line Feed)
U+000D #(Carriage Return)
U+000D, U+000A # (Carriage Return + Line Feed)
To solve this confusion U+0085
was introduced
U+0085 # (Next Line)
C0: U+0000..U+001F
␀ ␁ ␂ ␃ ␄ ␅ ␆ ␇ ␈ ␉ ␊ ␋ ␌ ␍ ␎ ␏
␐ ␑ ␒ ␓ ␔ ␕ ␖ ␗ ␘ ␙ ␚ ␛ ␜ ␝ ␞ ␟
Also C0: U+007F
(Backspace)
␡
C1: U+0080..U+009F
PAD HOP BPH NBH IND NEL SSA ESA HTS HTJ VTS PLD PLU RI SS2 SS3 DCS PU1 PU2 STS CCH MW SPA EPA SOS SGC SCI CSI ST OSC PM APC
 is included in /\p{space}/
matches
Control Characters are matched with /\p{Cc}/
Or use characteristics gem
Characteristics.create("\u{80}").c0? #=> false
Characteristics.create("\u{80}").c1? #=> true
⸻
THREE-EM DASH
𒐫
﷽
‱
𒈙
Not well defined
Troublesome in fixed-width environments
EastAsianWidth.txt
assignes double width to many Asian characters
However, one character category is "ambiguous"
Ambiguous characters can be single or double-width
unicode-display_width gem
Unicode::DisplayWidth.of("⚀") # => 1
Unicode::DisplayWidth.of("一") # => 2
Unicode::DisplayWidth.of("·", 1) # => 1
Unicode::DisplayWidth.of("·", 2) # => 2
�
<surrogate-D800>
1) UTF-16 Surrogates
U+D800..U+DFFF
2) Too Large
U+10FFFF
(= 1 114 111)U+FFFFFFFF
(= 4 294 967 295)U+1FFFFF
(= 2 097 151)
Ruby does not let you create these from literals
"\u{D800}"
SyntaxError: (irb):52: invalid Unicode codepoint
"\u{110000}"
SyntaxError: (irb):54: invalid Unicode codepoint (too large)
If you really need to…
[0xD800].pack("U") #=> "\xED\xA0\x80"
[0x110000].pack("U") #=> "\xF4\x90\x80\x80"
String#valid_encoding?
Checks if encoding contains invalid codepoints/bytes
String#scrub
Replaces invalid codepoints/bytes with U+FFFD
(�)
n/c
<noncharacter-10FFFF>
1) Non-Characters 66
Range of U+FDD0..U+FDEF
and the last two codepoints of each plane: U+XFFFE
, U+XFFFF
2) Private-Use Codepoints 137 468
(U+F0FF
)
(U+F8FF
)
3) Not-Yet Assigned Codepoints 837 775
1) Non-Characters
/\p{non character codepoint}/
2) Private-Use Codepoints
/\p{private use}/
3) Not-Yet Assigned Codepoints
/\p{unassigned}(?<!\p{non character codepoint})/
] [
NO-BREAK SPACE
There are tons of non-visible characters in Unicode
Only some are considered as whitespace
More examples for "blank" codepoints
][
MONGOLIAN VOWEL SEPARATOR
] [
EM SPACE
][
ZERO WIDTH JOINER
][
INVISIBLE PLUS
]⠀[
BRAILLE PATTERN BLANK
] [
IDEOGRAPHIC SPACE
]𝅙[
MUSICAL SYMBOL NULL NOTEHEAD
Group of characters which render nothing
The whole range of E0000..E0FFF
is ignorable!
U+180B..U+180D
U+FE00..0+FE0F
U+E0100..U+E01EF
259 invisible and ignorable codepoints
Used to trigger a visual variation on preceding character
U+FE0E
makes some text-based emoji switch to image-based ones
U+FE0F
the other way around
U+E0001
U+E0020..U+E007F
Allow creation of language tag sequences,
which are deprecated
Not all invisble characters are whitespaces
Some are matched by \s
Some more are matched by \p{space}
Ignorables are matched by/\p{default ignorable code point}/
Some are not matched by any property
characteristics gem
Characteristics.create(" ").blank? # => true
Characteristics.create(" ").ignorable? # => false
👨🏻🍳
MAN COOK: LIGHT SKIN TONE
MAN COOK: LIGHT SKIN TONE
Seven (!) different concepts of Emoji creation
U+FE0E
)Allows you to combine multiple emoji using U+200C
It is valid to join all kinds of emoji…
…but only the ones from the Emoji Standard are recommended
"Portugal" 🇵🇹
Tag Sequences are deprecated
…but let's use Tag Characters for Scotish flags!
"Scotland" 🏴
unicode-emoji gem
"String with Emoji like 🛌🏽 or 🤾🏽♀️ or 🏴".scan \
Unicode::Emoji::REGEX # => ["🛌🏽", "🤾🏽♀️", "🏴"]
Unicode::Emoji.list
# => {"Smileys & People"
# =>{"face-positive"
# =>["😀", "😁", "😂" …
① Graphemes\X
or String#each_grapheme_cluster
② Normalization
unicode_normalize (stdllib)
③ Confusables
unicode-confusable (gem)
④ Case-Mapping
Built in, optional :turkic
option
⑤ Control Characters
characteristics (gem)
⑥ Display Width
unicode-display_width (gem)
⑦ InvalidString#valid_encoding?
, String#scrub
⑧ Unassigned\p{unassigned}
, \p{private use}
⑨ Invisible
characteristics (gem)
⑩ Emoji
unicode-emoji (gem)