AwesomeList

444 awesome lists. 23 categories. 409 curators. I feel lucky.
Miscellaneous

Unicode

😂 👌 A curated list of delightful Unicode tidbits, packages and resources.git.io/Awesome-Unicode

awesomeawesome-listemojislistunicodeunicode-charactersunicode-consortiumunicode-standardutf-16utf-8utf16utf8

jagraceyjagracey/Awesome-UnicodeUpdated Outdated 490 16 34

Awesome Unicode Awesome

A curated list of delightful Unicode tidbits, packages and resources.

Please read the contribution guidelines before contributing. Key Unicode terminology is defined in the glossary.



Foreword

Unicode is Awesome! Prior to Unicode, international communication was grueling- everyone had defined their separate extended character set in the upperhalf of ASCII (called Code Pages) that would conflict- Just think, German speakers coordinating with Korean speakers over which 127 character Code Page to use. Thankfully the Unicode standard caught on and unified communication. Unicode 8.0 standardizes over 120,000 characters from over 129 scripts - some modern, some ancient, and some still undeciphered. Unicode handles left-to-right and right-to-left text, combining marks, and includes diverse cultural, political, religious characters and emojis. Unicode is awesomely human - and ultimately underappreciated.


Contents

Quick Unicode Background

What Characters Does the Unicode Standard Include?

The Unicode Standard defines codes for characters used in all the major languages written today. Scripts include the European alphabetic scripts, Middle Eastern right-to-left scripts, and many scripts of Asia.

The Unicode Standard further includes punctuation marks, diacritics, mathematical symbols, technical symbols, arrows, dingbats, emoji, etc. It provides codes for diacritics, which are modifying character marks such as the tilde (~), that are used in conjunction with base characters to represent accented letters (ñ, for example). In all, the Unicode Standard, Version 9.0 provides codes for 128,172 characters from the world's alphabets, ideograph sets, and symbol collections.

The majority of common-use characters fit into the first 64K code points, an area of the codespace that is called the basic multilingual plane, or BMP for short. There are sixteen other supplementary planes available for encoding other characters, with currently over 850,000 unused code points. More characters are under consideration for addition to future versions of the standard.

The Unicode Standard also reserves code points for private use. Vendors or end users can assign these internally for their own characters and symbols, or use them with specialized fonts. There are 6,400 private use code points on the BMP and another 131,068 supplementary private use code points, should 6,400 be insufficient for particular applications.

Unicode Character Encodings

Character encoding standards define not only the identity of each character and its numeric value, or code point, but also how this value is represented in bits.

The Unicode Standard defines three encoding forms that allow the same data to be transmitted in a byte, word or double word oriented format (i.e. in 8, 16 or 32-bits per code unit). All three encoding forms encode the same common character repertoire and can be efficiently transformed into one another without loss of data. The Unicode Consortium fully endorses the use of any of these encoding forms as a conformant way of implementing the Unicode Standard.

UTF-8 is popular for HTML and similar protocols. UTF-8 is a way of transforming all Unicode characters into a variable length encoding of bytes. It has the advantages that the Unicode characters corresponding to the familiar ASCII set have the same byte values as ASCII, and that Unicode characters transformed into UTF-8 can be used with much existing software without extensive software rewrites.

UTF-16 is popular in many environments that need to balance efficient access to characters with economical use of storage. It is reasonably compact and all the heavily used characters fit into a single 16-bit code unit, while all other characters are accessible via pairs of 16-bit code units.

UTF-32 is useful where memory space is no concern, but fixed width, single code unit access to characters is desired. Each Unicode character is encoded in a single 32-bit code unit when using UTF-32.

All three encoding forms need at most 4 bytes (or 32-bits) of data for each character.

Lets talk Numbers

The Unicode characterset is divided into 17 core segments called "planes", which are further divided into blocks. Each plane has space for 65,536 (2¹⁶) codepoints, supporting a grand total of 1,114,112 codepoints. There are two "Private Use Area" planes (#16 & #17) that are allocated to be used however one wishes. These two Private Use planes account for 131,072 codepoints.

#NameRange
1.Basic Multilingual Plane(U+0000 to U+FFFF)
2.Supplementary Multilingual Plane(U+10000 to U+1FFFF)
3.Supplementary Ideographic Plane(U+20000 to U+2FFFF)
4.Tertiary Ideographic Plane(U+30000 to U+3FFFF)
5.Plane 5 (unassigned)(U+40000 to U+4FFFF)
6.Plane 6 (unassigned)(U+50000 to U+5FFFF)
7.Plane 7 (unassigned)(U+60000 to U+6FFFF)
8.Plane 8 (unassigned)(U+70000 to U+7FFFF)
9.Plane 9 (unassigned)(U+80000 to U+8FFFF)
10.Plane 10 (unassigned)(U+90000 to U+9FFFF)
11.Plane 11 (unassigned)(U+A0000 to U+AFFFF)
12.Plane 12 (unassigned)(U+B0000 to U+BFFFF)
13.Plane 13 (unassigned)(U+C0000 to U+CFFFF)
14.Plane 14 (unassigned)(U+D0000 to U+DFFFF)
15.Supplementary Special-purpose Plane(U+E0000 to U+EFFFF)
16.Supplementary Private Use Area - A(U+F0000 to U+FFFFF)
17.Supplementary Private Use Area - B(U+100000 to U+10FFFF)

The first plane is called the Basic Multilingual Plane or BMP. It contains the code points from U+0000 to U+FFFF, which are the most frequently used characters. The other sixteen planes (U+010000 → U+10FFFF) are called supplementary planes or astral planes.

UTF-16 Surrogate Pairs

Characters outside the BMP, e.g. U+1D306 tetragram for centre (𝌆), can only be encoded in UTF-16 using two 16-bit code units: 0xD834 0xDF06. This is called a surrogate pair. Note that a surrogate pair only represents a single character. The first code unit of a surrogate pair is always in the range from 0xD800 to 0xDBFF, and is called a high surrogate or a lead surrogate. The second code unit of a surrogate pair is always in the range from 0xDC00 to 0xDFFF, and is called a low surrogate or a trail surrogate.

-- Mathias Bynens

Surrogate pair: A representation for a single abstract character that consists of a sequence of two 16-bit code units, where the first value of the pair is a high-surrogate code unit and the second value is a low-surrogate code unit. Surrogate pairs are used only in UTF-16. (See Section 3.9, Unicode Encoding Forms.) -- Unicode 8.0.0 Chapter 3 - Surrogates

Calculating Surrogate Pairs

The Unicode character 💩 Pile of Poo (U+1F4A9) in UTF-16 must be encoded as a surrogate pair, i.e. two surrogates. To convert any code point to a surrogate pair, use the following algorithm (in JavaScript). Keep in mind that we're using hexidecimal notation.

 var High_Surrogate = function(Code_Point){ return Math.floor((Code_Point - 0x10000) / 0x400) + 0xD800 };
 var Low_Surrogate  = function(Code_Point){ return (Code_Point - 0x10000) % 0x400 + 0xDC00 };

// Reverses The Conversion var CodePoint = function(HighSurrogate, LowSurrogate){ return (HighSurrogate - 0xD800) * 0x400 + Low_Surrogate - 0xDC00 + 0x10000; };

 > var codepoint = 0x1F4A9;   								// 0x1F4A9 == 128169
 > High_Surrogate(codepoint).toString(16)
 "d83d"  													// 0xD83D == 55357
 > Low_Surrogate(codepoint).toString(16)
 "dca9"  													// 0xDCA9 == 56489

> String.fromCharCode( HighSurrogate(codepoint) , LowSurrogate(codepoint) ); "💩" > String.fromCodePoint(0x1F4A9) "💩" > '\ud83d\udca9' "💩"

Composing & Decomposing

Unicode includes a mechanism for modifying character shape that greatly extends the supported glyph repertoire. This covers the use of combining diacritical marks. They are inserted after the main character. Multiple combining diacritics may be stacked over the same character. Unicode also contains precomposed versions of most letter/diacritic combinations in normal use.

Certain sequences of characters can also be represented as a single character, called a precomposed character (or composite or decomposible character). For example, the character "ü" can be encoded as the single code point U+00FC "ü" or as the base character U+0075 "u" followed by the non-spacing character U+0308 "¨". The Unicode Standard encodes precomposed characters for compatibility with established standards such as Latin 1, which includes many precomposed characters such as "ü" and "ñ".

Precomposed characters may be decomposed for consistency or analysis. For example, in alphabetizing (collating) a list of names, the character "ü" may be decomposed into a "u" followed by the non-spacing character "¨". Once the character has been decomposed, it may be easier for the collation to work with the character because it can be processed as a "u" with modifications. This allows easier alphabetical sorting for languages where character modifiers do not affect alphabetical order. The Unicode Standard defines the decompositions for all precomposed characters. It also defines normalization forms to provide for unique representations of characters.

Myths of Unicode

From Mark Davis's Unicode Myths slides.

  • Unicode is simply a 16-bit code - Some people are under the misconception that Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters. This is not, actually, correct. It is the single most common myth about Unicode, so if you thought that, don't feel bad.

  • You can use any unassigned codepoint for internal use - No. Eventually that hole will be filled with a different character. Instead use private use or noncharacters.

  • Every Unicode code point represents a character - No. There are lots of nonCharacters (FFFE, FFFF, 1FFFE,…) There are also surrogate code points, private and unassigned codepoints, and control/format “characters" (RLM, ZWNJ,…)

  • Unicode will run out of space - If it were linear, we would run out in 2140 AD. But it isn't linear. See http://www.unicode.org/roadmaps/

  • Case mappings are 1-1 - No. They can also be:

    • One-to-many: (ß → SS )
    • Contextual: (…Σ ↔ …ς AND …ΣΤ… ↔ …στ… )
    • Locale-sensitive: ( I ↔ ı AND İ ↔ i )

Applied Unicode Encodings

Encoding TypeRaw Encoding
HTML Entity (Decimal)🖖
HTML Entity (Hexadecimal)🖖
URL Escape Code%F0%9F%96%96
UTF-8 (hex)0xF0 0x9F 0x96 0x96 (f09f9696)
UTF-8 (binary)11110000:10011111:10010110:10010110
UTF-16/UTF-16BE (hex)0xD83D 0xDD96 (d83ddd96)
UTF-16LE (hex)0x3DD8 0x96DD (3dd896dd)
UTF-32/UTF-32BE (hex)0x0001F596 (0001f596)
UTF-32LE (hex)0x96F50100 (96f50100)
Octal Escape Sequence\360\237\226\226

Source Code

Encoding TypeRaw Encoding
JavaScript\u1F596
JSON\u1F596
C\u1F596
C++\u1F596
Java\u1F596
Python\u1F596
Perl\x{1F596}
Ruby\u{1F596}
CSS\01F596

Awesome Characters List

[![](http://imgs.xkcd.com/comics/rtl.png )](https://xkcd.com/1137/)

Special Characters

The Unicode Consortium published a general punctuation chart where you can find more details.

CharNameDescription
''U+FEFF (Byte Order Mark - BOM)has the important property of unambiguity on byte reorder. It is also zerowidth, and invisible. In non-complying software (like the PHP interpreter) this leads to all sorts of fun behaviour.
'￯''\uFFEF' Reversed Byte Order Mark (BOM)does not equate to a legal character, other than the beginning of text.
'​''\u200B' zero-width non-break space(a character with no appearance and no effect other than preventing the formation of ligatures).
' 'U+00A0 NO-BREAK SPACEforce adjacent characters to stick together. Well known as    in HTML.
'­'U+00AD SOFT HYPHEN(in HTML: ­) like ZERO WIDTH SPACE, but show a hyphen if (and only if) a break occurs.
'‍'U+200D ZERO WIDTH JOINERforce adjacent characters to be joined together (e.g., arabic characters or supported emoji). Can be used this to compose sequentially combined emoji.
'⁠'U+2060 WORD JOINERthe same as U+00A0, but completely invisible. Good for writing @font-face on Twitter.
' 'U+1680 OGHAM SPACE MARKa space that looks like a dash. Great to bring programmers close to madness: 1 +  2 === 3.
';'U+037E GREEK QUESTION MARKa look-alike to the semicolon. Also a fun way to annoy developers.
'‭'U+202Dchange the text direction to Left-to-Right.
'‮'‭ ‭U+202Echange the text direction to Right-to-Left:
'ꓸ'U+A4F8 LISU LETTER TONE MYA TIA lookalike for the period character.
'ꓹ'U+A4F9 LISU LETTER TONE NA POA lookalike for the comma character.
'ꓼ'U+A4FC LISU LETTER TONE MYA NAA lookalike for the semi-colon character.
'ꓽ'U+A4FD LISU LETTER TONE MYA JEUA lookalike for the colon character.
'︀'Variation Selectors ( U+FE00 to U+FE0F & U+E0100 to U+E01EF )a block of 256 zero width characters that posess the ID_Continue proprerty- meaning they can be used in variable names (not the first letter). What makes these special is the fact that mouse cursors pass over them as they are combining characters - unlike most other zero width characters.
'ᅟ'U+115F HANGUL CHOSEONG FILLERIn general it produces a space. Rendered as zero width (invisible) if not explicitly supported in rendering. Designated ID_Start
'ᅠ'U+1160 HANGUL JUNGSEONG FILLERPerhaps it produces a space? Rendered as zero width (invisible) if not explicitly supported in rendering. Designated ID_Start
'ㅤ'U+3164 HANGUL FILLERIn general it produces a space. Rendered as zero width (invisible) if not explicitly supported in rendering. Designated ID_Start


Wait a second... what did I just read?



Variable identifiers can effectively include whitespace!

The U+3164 HANGUL FILLER character displays as an advancing whitespace character. The character is rendered as completely invisible (and non advancing, i.e. "zero width"), if not explicitly supported in rendering. That means the ugly character replacement (�) symbol should never be displayed.

I'm not yet sure why U+3164 was specified to behave this way. Interestingly, U+3164 was added to Unicode in version 1.1 (1993)- so the consortium must have had a lot of time to think it through. Anyway, here are a few examples.

> var= 'foo';
undefined
>'foo'

> var= alert; undefined > var foo = 'bar' undefined > if ( foo ===</span>baz<span class="pl-pds"> ){} // alert undefined

> var varㅤfooㅤ\u{A60C}ㅤπ = 'bar'; undefined > varㅤfooㅤꘌㅤπ 'bar'


**NOTE:** I've tested U+3164 rendering on Ubuntu and OS X with the following: `node`, `php`, `ruby`, `python3.5`, `scala` ,`vim`, `cat`, `chrome`+`github gist`. Atom is the only system that fails by (incorrectly) displaying empty boxes. I have yet to test it out on Emacs and Sublime. From what I understand, the Unicode Consortium will not reassign or rename characters or codepoints, but may be convinced to change character properties like ID_Start/ID_Continue.

Modifiers

The zero-width joiner (ZWJ) is a non-printing character used in the computerized typesetting of some complex scripts such as the Arabic script or any Indic script. When placed between two characters that would otherwise not be connected, a ZWJ causes them to be printed in their connected forms.

The zero-width non-joiner (ZWNJ) is a non-printing character used in the computerization of writing systems that make use of ligatures. When placed between two characters that would otherwise be connected into a ligature, a ZWNJ causes them to be printed in their final and initial forms, respectively. This is also an effect of a space character, but a ZWNJ is used when it is desirable to keep the words closer together or to connect a word with its morpheme.

> 'a'
 "a"

> 'a\u{0308}' ""

> 'a\u{20DE}\u{0308}' "a⃞̈"

> 'a\u{20DE}\u{0308}\u{20DD}' "a⃞̈⃝"

// Modifying Invisible Characters > '\u{200E}\u{200E}\u{200E}\u{200E}\u{200E}\u{200E}\u{200E}\u{200E}\u{200E}\u{200E}' "‎‎‎‎‎‎‎‎‎‎"

> '\u{200E}\u{200E}\u{200E}\u{200E}\u{200E}\u{200E}\u{200E}\u{200E}\u{200E}\u{200E}'.length 10

💥 Uppercase Transformation Collisions

CharCode PointOutput Char
ß0x00DFSS
ı0x0131I
ſ0x017FS
0xFB00FF
0xFB01FI
0xFB02FL
0xFB03FFI
0xFB04FFL
0xFB05ST
0xFB06ST

💥 Lowercase Transformation Collisions

CharCode PointOutput Char
0x212Ak

Quirks and Troubleshooting

  • String length is typically determined by counting codepoints. This means that surrogate pairs would count as two characters. Combining multiple diacritics may be stacked over the same character. a + ̈ == ̈a , increasing length, while only producing a single character.

  • Similarily, reversing strings often is a non-trivial task. Again, surrogate pairs and diacritics must be reversed together. ES Reverser provides a pretty good solution.

  • Upper and lower case mappings are not always one-to-one. They can also be:

    • One-to-many: (ß → SS )
    • Contextual: (…Σ ↔ …ς AND …ΣΤ… ↔ …στ… )
    • Locale-sensitive: ( I ↔ ı AND İ ↔ i )

One-To-Many Case Mappings

Most of the below characters express their one-to-many case mappings when uppercased- while others should be lowercased. This list should be split up

Code PointCharacterNameMapped CharacterMapped Code Points
U+00DFßLATIN SMALL LETTER SHARP Ss, sU+0073, U+0073
U+0130İLATIN CAPITAL LETTER I WITH DOT ABOVEi, ̇U+0069, U+0307
U+0149ʼnLATIN SMALL LETTER N PRECEDED BY APOSTROPHEʼ, nU+02BC, U+006E
U+01F0ǰLATIN SMALL LETTER J WITH CARONj, ̌U+006A, U+030C
U+0390ΐGREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOSι, ̈, ́U+03B9, U+0308, U+0301
U+03B0ΰGREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOSυ, ̈, ́U+03C5, U+0308, U+0301
U+0587ևARMENIAN SMALL LIGATURE ECH YIWNե, ւU+0565, U+0582
U+1E96LATIN SMALL LETTER H WITH LINE BELOWh, ̱U+0068, U+0331
U+1E97LATIN SMALL LETTER T WITH DIAERESISt, ̈U+0074, U+0308
U+1E98LATIN SMALL LETTER W WITH RING ABOVEw, ̊U+0077, U+030A
U+1E99LATIN SMALL LETTER Y WITH RING ABOVEy, ̊U+0079, U+030A
U+1E9ALATIN SMALL LETTER A WITH RIGHT HALF RINGa, ʾU+0061, U+02BE
U+1E9ELATIN CAPITAL LETTER SHARP Ss, sU+0073, U+0073
U+1F50GREEK SMALL LETTER UPSILON WITH PSILIυ, ̓U+03C5, U+0313
U+1F52GREEK SMALL LETTER UPSILON WITH PSILI AND VARIAυ, ̓, ̀U+03C5, U+0313, U+0300
U+1F54GREEK SMALL LETTER UPSILON WITH PSILI AND OXIAυ, ̓, ́U+03C5, U+0313, U+0301
U+1F56GREEK SMALL LETTER UPSILON WITH PSILI AND PERISPOMENIυ, ̓, ͂U+03C5, U+0313, U+0342
U+1F80GREEK SMALL LETTER ALPHA WITH PSILI AND YPOGEGRAMMENI, ιU+1F00, U+03B9
U+1F81GREEK SMALL LETTER ALPHA WITH DASIA AND YPOGEGRAMMENI, ιU+1F01, U+03B9
U+1F82GREEK SMALL LETTER ALPHA WITH PSILI AND VARIA AND YPOGEGRAMMENI, ιU+1F02, U+03B9
U+1F83GREEK SMALL LETTER ALPHA WITH DASIA AND VARIA AND YPOGEGRAMMENI, ιU+1F03, U+03B9
U+1F84GREEK SMALL LETTER ALPHA WITH PSILI AND OXIA AND YPOGEGRAMMENI, ιU+1F04, U+03B9
U+1F85GREEK SMALL LETTER ALPHA WITH DASIA AND OXIA AND YPOGEGRAMMENI, ιU+1F05, U+03B9
U+1F86GREEK SMALL LETTER ALPHA WITH PSILI AND PERISPOMENI AND YPOGEGRAMMENI, ιU+1F06, U+03B9
U+1F87GREEK SMALL LETTER ALPHA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI, ιU+1F07, U+03B9
U+1F88GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI, ιU+1F00, U+03B9
U+1F89GREEK CAPITAL LETTER ALPHA WITH DASIA AND PROSGEGRAMMENI, ιU+1F01, U+03B9
U+1F8AGREEK CAPITAL LETTER ALPHA WITH PSILI AND VARIA AND PROSGEGRAMMENI, ιU+1F02, U+03B9
U+1F8BGREEK CAPITAL LETTER ALPHA WITH DASIA AND VARIA AND PROSGEGRAMMENI, ιU+1F03, U+03B9
U+1F8CGREEK CAPITAL LETTER ALPHA WITH PSILI AND OXIA AND PROSGEGRAMMENI, ιU+1F04, U+03B9
U+1F8DGREEK CAPITAL LETTER ALPHA WITH DASIA AND OXIA AND PROSGEGRAMMENI, ιU+1F05, U+03B9
U+1F8EGREEK CAPITAL LETTER ALPHA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI, ιU+1F06, U+03B9
U+1F8FGREEK CAPITAL LETTER ALPHA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI, ιU+1F07, U+03B9
U+1F90GREEK SMALL LETTER ETA WITH PSILI AND YPOGEGRAMMENI, ιU+1F20, U+03B9
U+1F91GREEK SMALL LETTER ETA WITH DASIA AND YPOGEGRAMMENI, ιU+1F21, U+03B9
U+1F92GREEK SMALL LETTER ETA WITH PSILI AND VARIA AND YPOGEGRAMMENI, ιU+1F22, U+03B9
U+1F93GREEK SMALL LETTER ETA WITH DASIA AND VARIA AND YPOGEGRAMMENI, ιU+1F23, U+03B9
U+1F94GREEK SMALL LETTER ETA WITH PSILI AND OXIA AND YPOGEGRAMMENI, ιU+1F24, U+03B9
U+1F95GREEK SMALL LETTER ETA WITH DASIA AND OXIA AND YPOGEGRAMMENI, ιU+1F25, U+03B9
U+1F96GREEK SMALL LETTER ETA WITH PSILI AND PERISPOMENI AND YPOGEGRAMMENI, ιU+1F26, U+03B9
U+1F97GREEK SMALL LETTER ETA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI, ιU+1F27, U+03B9
U+1F98GREEK CAPITAL LETTER ETA WITH PSILI AND PROSGEGRAMMENI, ιU+1F20, U+03B9
U+1F99GREEK CAPITAL LETTER ETA WITH DASIA AND PROSGEGRAMMENI, ιU+1F21, U+03B9
U+1F9AGREEK CAPITAL LETTER ETA WITH PSILI AND VARIA AND PROSGEGRAMMENI, ιU+1F22, U+03B9
U+1F9BGREEK CAPITAL LETTER ETA WITH DASIA AND VARIA AND PROSGEGRAMMENI, ιU+1F23, U+03B9
U+1F9CGREEK CAPITAL LETTER ETA WITH PSILI AND OXIA AND PROSGEGRAMMENI, ιU+1F24, U+03B9
U+1F9DGREEK CAPITAL LETTER ETA WITH DASIA AND OXIA AND PROSGEGRAMMENI, ιU+1F25, U+03B9
U+1F9EGREEK CAPITAL LETTER ETA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI, ιU+1F26, U+03B9
U+1F9FGREEK CAPITAL LETTER ETA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI, ιU+1F27, U+03B9
U+1FA0GREEK SMALL LETTER OMEGA WITH PSILI AND YPOGEGRAMMENI, ιU+1F60, U+03B9
U+1FA1GREEK SMALL LETTER OMEGA WITH DASIA AND YPOGEGRAMMENI, ιU+1F61, U+03B9
U+1FA2GREEK SMALL LETTER OMEGA WITH PSILI AND VARIA AND YPOGEGRAMMENI, ιU+1F62, U+03B9
U+1FA3GREEK SMALL LETTER OMEGA WITH DASIA AND VARIA AND YPOGEGRAMMENI, ιU+1F63, U+03B9
U+1FA4GREEK SMALL LETTER OMEGA WITH PSILI AND OXIA AND YPOGEGRAMMENI, ιU+1F64, U+03B9
U+1FA5GREEK SMALL LETTER OMEGA WITH DASIA AND OXIA AND YPOGEGRAMMENI, ιU+1F65, U+03B9
U+1FA6GREEK SMALL LETTER OMEGA WITH PSILI AND PERISPOMENI AND YPOGEGRAMMENI, ιU+1F66, U+03B9
U+1FA7GREEK SMALL LETTER OMEGA WITH DASIA AND PERISPOMENI AND YPOGEGRAMMENI, ιU+1F67, U+03B9
U+1FA8GREEK CAPITAL LETTER OMEGA WITH PSILI AND PROSGEGRAMMENI, ιU+1F60, U+03B9
U+1FA9GREEK CAPITAL LETTER OMEGA WITH DASIA AND PROSGEGRAMMENI, ιU+1F61, U+03B9
U+1FAAGREEK CAPITAL LETTER OMEGA WITH PSILI AND VARIA AND PROSGEGRAMMENI, ιU+1F62, U+03B9
U+1FABGREEK CAPITAL LETTER OMEGA WITH DASIA AND VARIA AND PROSGEGRAMMENI, ιU+1F63, U+03B9
U+1FACGREEK CAPITAL LETTER OMEGA WITH PSILI AND OXIA AND PROSGEGRAMMENI, ιU+1F64, U+03B9
U+1FADGREEK CAPITAL LETTER OMEGA WITH DASIA AND OXIA AND PROSGEGRAMMENI, ιU+1F65, U+03B9
U+1FAEGREEK CAPITAL LETTER OMEGA WITH PSILI AND PERISPOMENI AND PROSGEGRAMMENI, ιU+1F66, U+03B9
U+1FAFGREEK CAPITAL LETTER OMEGA WITH DASIA AND PERISPOMENI AND PROSGEGRAMMENI, ιU+1F67, U+03B9
U+1FB2GREEK SMALL LETTER ALPHA WITH VARIA AND YPOGEGRAMMENI, ιU+1F70, U+03B9
U+1FB3GREEK SMALL LETTER ALPHA WITH YPOGEGRAMMENIα, ιU+03B1, U+03B9
U+1FB4GREEK SMALL LETTER ALPHA WITH OXIA AND YPOGEGRAMMENIά, ιU+03AC, U+03B9
U+1FB6GREEK SMALL LETTER ALPHA WITH PERISPOMENIα, ͂U+03B1, U+0342
U+1FB7GREEK SMALL LETTER ALPHA WITH PERISPOMENI AND YPOGEGRAMMENIα, ͂, ιU+03B1, U+0342, U+03B9
U+1FBCGREEK CAPITAL LETTER ALPHA WITH PROSGEGRAMMENIα, ιU+03B1, U+03B9
U+1FC2GREEK SMALL LETTER ETA WITH VARIA AND YPOGEGRAMMENI, ιU+1F74, U+03B9
U+1FC3GREEK SMALL LETTER ETA WITH YPOGEGRAMMENIη, ιU+03B7, U+03B9
U+1FC4GREEK SMALL LETTER ETA WITH OXIA AND YPOGEGRAMMENIή, ιU+03AE, U+03B9
U+1FC6GREEK SMALL LETTER ETA WITH PERISPOMENIη, ͂U+03B7, U+0342
U+1FC7GREEK SMALL LETTER ETA WITH PERISPOMENI AND YPOGEGRAMMENIη, ͂, ιU+03B7, U+0342, U+03B9
U+1FCCGREEK CAPITAL LETTER ETA WITH PROSGEGRAMMENIη, ιU+03B7, U+03B9
U+1FD2GREEK SMALL LETTER IOTA WITH DIALYTIKA AND VARIAι, ̈, ̀U+03B9, U+0308, U+0300
U+1FD3GREEK SMALL LETTER IOTA WITH DIALYTIKA AND OXIAι, ̈, ́U+03B9, U+0308, U+0301
U+1FD6GREEK SMALL LETTER IOTA WITH PERISPOMENIι, ͂U+03B9, U+0342
U+1FD7GREEK SMALL LETTER IOTA WITH DIALYTIKA AND PERISPOMENIι, ̈, ͂U+03B9, U+0308, U+0342
U+1FE2GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND VARIAυ, ̈, ̀U+03C5, U+0308, U+0300
U+1FE3GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND OXIAυ, ̈, ́U+03C5, U+0308, U+0301
U+1FE4GREEK SMALL LETTER RHO WITH PSILIρ, ̓U+03C1, U+0313
U+1FE6GREEK SMALL LETTER UPSILON WITH PERISPOMENIυ, ͂U+03C5, U+0342
U+1FE7GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND PERISPOMENIυ, ̈, ͂U+03C5, U+0308, U+0342
U+1FF2GREEK SMALL LETTER OMEGA WITH VARIA AND YPOGEGRAMMENI, ιU+1F7C, U+03B9
U+1FF3GREEK SMALL LETTER OMEGA WITH YPOGEGRAMMENIω, ιU+03C9, U+03B9
U+1FF4GREEK SMALL LETTER OMEGA WITH OXIA AND YPOGEGRAMMENIώ, ιU+03CE, U+03B9
U+1FF6GREEK SMALL LETTER OMEGA WITH PERISPOMENIω, ͂U+03C9, U+0342
U+1FF7GREEK SMALL LETTER OMEGA WITH PERISPOMENI AND YPOGEGRAMMENIω, ͂, ιU+03C9, U+0342, U+03B9
U+1FFCGREEK CAPITAL LETTER OMEGA WITH PROSGEGRAMMENIω, ιU+03C9, U+03B9
U+FB00LATIN SMALL LIGATURE FFf, fU+0066, U+0066
U+FB01LATIN SMALL LIGATURE FIf, iU+0066, U+0069
U+FB02LATIN SMALL LIGATURE FLf, lU+0066, U+006C
U+FB03LATIN SMALL LIGATURE FFIf, f, iU+0066, U+0066, U+0069
U+FB04LATIN SMALL LIGATURE FFLf, f, lU+0066, U+0066, U+006C
U+FB05LATIN SMALL LIGATURE LONG S Ts, tU+0073, U+0074
U+FB06LATIN SMALL LIGATURE STs, tU+0073, U+0074
U+FB13ARMENIAN SMALL LIGATURE MEN NOWմ, նU+0574, U+0576
U+FB14ARMENIAN SMALL LIGATURE MEN ECHմ, եU+0574, U+0565
U+FB15ARMENIAN SMALL LIGATURE MEN INIմ, իU+0574, U+056B
U+FB16ARMENIAN SMALL LIGATURE VEW NOWվ, նU+057E, U+0576
U+FB17ARMENIAN SMALL LIGATURE MEN XEHմ, խU+0574, U+056D

Awesome Packages & Libraries

  • PhantomScript - 👻 🔦 Invisible JavaScript code execution & social engineering
  • ESReverser - A Unicode-aware string reverser written in JavaScript.
  • mimic - [ab]using Unicode to create tragedy
  • python-ftfy - Given Unicode text, make its representation consistent and possibly less broken.
  • vim-troll-stopper - Stop Unicode trolls from messing with your code.

Emojis

Diversity

The Unicode Consortium has made a huge effort better reflect and incorporate human diversity, including cultural practices. Here is the Consortium's diversity report.

Emojis of mixed gender situations are now available, such as same sex families, holding hands, and kissing. The real kicker are Emoji combined sequences. Basically:

Code PointsRecipeCombined
U+1F469 U+200D U+2764 U+FE0F U+200D U+1F469👩 ❤️‍ ❤️‍ ❤️‍ 👩couple with heart: woman, woman
U+1F468 U+200D U+1F468 U+200D U+1F467 U+200D U+1F466

Further, emojis now support skin color modifiers.

Five symbol modifier characters that provide for a range of skin tones for human emoji were released in Unicode Version 8.0 (mid-2015). These characters are based on the six tones of the Fitzpatrick scale, a recognized standard for dermatology (there are many examples of this scale online, such as FitzpatrickSkinType.pdf). The exact shades may vary between implementations. -- Unicode Consortium's Diversity report

CodeNameSamples
U+1F3FBEMOJI MODIFIER FITZPATRICK TYPE-1-2
U+1F3FCEMOJI MODIFIER FITZPATRICK TYPE-3
U+1F3FDEMOJI MODIFIER FITZPATRICK TYPE-4
U+1F3FEEMOJI MODIFIER FITZPATRICK TYPE-5
U+1F3FFEMOJI MODIFIER FITZPATRICK TYPE-6

Just follow the desired Emoji with one of the skin color modifiers \u{1F466}\u{1F3FE}.

 → 

Creatively Naming Variables and Methods

Examples are written in JavaScript (ES6)

In general, characters designated the ID_START property may be used at the beggining of a variable name. Characters designated with the ID_CONTINUE property may be used after the first character of a variable.

function rand(μ,σ){ ... };

String.prototype.reverse= function(){..};

Number.prototype.isTrueɁ = function(){..};

var WhatDoesThisDoɁɁɁɁ = 42

Here are some really creative variable names from Mathias Bynes

// How convenient!
var π = Math.PI;

// Sometimes, you just have to use the Bad Parts of JavaScript: var ಠ_ಠ = eval;

// Code, Y U NO WORK?! varಠ益ಠ= 42;

// How about a JavaScript library for functional programming? var λ = function() {};

// Obfuscate boring variable names for great justice var \u006C\u006F\u006C\u0077\u0061\u0074 = 'heh';

// …or just make up random ones var Ꙭൽↈⴱ = 'huh';

// While perfectly valid, this doesn’t work in most browsers: var foo\u200Cbar = 42;

// This is not a bitwise left shift (&lt;&lt;): var 〱〱 = 2; // This is, though: 〱〱 << 〱〱; // 8

// Give yourself a discount: var price9̶89 = 'cheap';

// Fun with Roman numerals var= 4; var= 5; Ⅳ + Ⅴ; // 9

// Cthulhu was here var Hͫ̆̒̐ͣ̊̄ͯ͗͏̵̗̻̰̠̬͝ͅE̴̷̬͎̱̘͇͍̾ͦ͊͒͊̓̓̐_̫̠̱̩̭̤͈̑̎̋ͮͩ̒͑̾͋͘Ç̳͕̯̭̱̲̣̠̜͋̍O̴̦̗̯̹̼ͭ̐ͨ̊̈͘͠M̶̝̠̭̭̤̻͓͑̓̊ͣͤ̎͟͠E̢̞̮̹͍̞̳̣ͣͪ͐̈T̡̯̳̭̜̠͕͌̈́̽̿ͤ̿̅̑Ḧ̱̱̺̰̳̹̘̰́̏ͪ̂̽͂̀͠ = 'Zalgo';

And here's some Unicode CSS Classes from David Walsh

<!-- place this within the document head -->
<meta charset="UTF-8" />

<!-- error message --> <div class="ಠ_ಠ">You do not have access to this page.</div>

<!-- success message --> <div class="">Your changes have been saved successfully!</div>

.ಠ_ಠ {
	border: 1px solid #f00;
}

.❤ { background: lightgreen; }

Recursive HTML Tag Renaming Script

If you want to rename all your HTML tags to what appears as nothing, the following script is just what your looking for.

Do note however that HTML does not support all unicode characters.

// U+1160 HANGUL JUNGSEONG FILLER
transformAllTags('');

// An actual HTML element node designed to look like a comment node, using the U+01C3 LATIN LETTER RETROFLEX CLICK // <ǃ-- name="viewport" content="width=device-width"></ǃ--> transformAllTags('ǃ--');

// or even <ᅠ⃝ transformAllTags('\u{1160}\u{20dd}');

// and for a bonus, all existing tag names will have each character ensquared. h⃞t⃞m⃞l⃞ transformAllTags();

function transformAllTags (newName){ // querySelectorAll doesn't actually return an array. Array.from(document.querySelectorAll('*')) .forEach(function(x){ transformTag(x, newName); }); }

function wonky(str){ return str.split('').join('\u{20de}') + '\u{20de}'; }

function transformTag(tagIdOrElem, tagType){ var elem = (tagIdOrElem instanceof HTMLElement) ? tagIdOrElem : document.getElementById(tagIdOrElem); if(!elem || !(elem instanceof HTMLElement))return; var children = elem.childNodes; var parent = elem.parentNode; var newNode = document.createElement(tagType||wonky(elem.tagName)); for(var a=0;a<elem.attributes.length;a++){ newNode.setAttribute(elem.attributes[a].nodeName, elem.attributes[a].value); } for(var i= 0,clen=children.length;i<clen;i++){ newNode.appendChild(children[0]); //0...always point to the first non-moved element } newNode.style.cssText = elem.style.cssText; parent.replaceChild(newNode,elem); }

Here is what it does support:

function testBegin(str){
 try{
    eval(`document.createElement( '${str}' );`)
    return true;
 }
 catch(e){ return false; }
}

function testContinue(str){ try{ eval(</span>document.createElement( 'a<span class="pl-s1"><span class="pl-pse">${</span>str<span class="pl-pse">}</span></span>' );<span class="pl-pds">) return true; } catch(e){ return false; } }

And heres some basic results

// Test if dashes can start an HTML Tag
> testBegin('-')
< false

> testContinue('-') < true

> testBegin('ᅠ-') // Prepend dash with U+1160 HANGUL JUNGSEONG FILLER < true

Unicode Fonts

A single TrueType / OpenType font format cannot cover all UTF-8 characters as there is a hard limit of 65535 glyphs in a font. Since there are over 1.1 million UTF-8 glphys, you will need to use a font-family to cover them all.

More Reading

Exploring Deeper into Unicode Yourself

Overview Map

A map of the Basic Multilingual Plane

Each numbered box represents 256 code points.

A map of the Basic Multilingual Plane. Each numbered box represents 256 code points.

The Chinese, Japanese and Korean (CJK) scripts share a common background, collectively known as CJK characters. In the process called Han unification, the common (shared) characters were identified and named "CJK Unified Ideographs".

Unicode Blocks

The Unicode standard arranges groups of characters together in blocks. This is the complete list of blocks across all 17 planes.

NameFromTo# Codepoints
Basic LatinU+0000U+007F(128)
Latin-1 SupplementU+0080U+00FF(128)
Latin Extended-AU+0100U+017F(128)
Latin Extended-BU+0180U+024F(208)
IPA ExtensionsU+0250U+02AF(96)
Spacing Modifier LettersU+02B0U+02FF(80)
Combining Diacritical MarksU+0300U+036F(112)
Greek and CopticU+0370U+03FF(135)
CyrillicU+0400U+04FF(256)
Cyrillic SupplementU+0500U+052F(48)
ArmenianU+0530U+058F(89)
HebrewU+0590U+05FF(87)
ArabicU+0600U+06FF(255)
SyriacU+0700U+074F(77)
Arabic SupplementU+0750U+077F(48)
ThaanaU+0780U+07BF(50)
NKoU+07C0U+07FF(59)
SamaritanU+0800U+083F(61)
MandaicU+0840U+085F(29)
Arabic Extended-AU+08A0U+08FF(50)
DevanagariU+0900U+097F(128)
BengaliU+0980U+09FF(93)
GurmukhiU+0A00U+0A7F(79)
GujaratiU+0A80U+0AFF(85)
OriyaU+0B00U+0B7F(90)
TamilU+0B80U+0BFF(72)
TeluguU+0C00U+0C7F(96)
KannadaU+0C80U+0CFF(87)
MalayalamU+0D00U+0D7F(100)
SinhalaU+0D80U+0DFF(90)
ThaiU+0E00U+0E7F(87)
LaoU+0E80U+0EFF(67)
TibetanU+0F00U+0FFF(211)
MyanmarU+1000U+109F(160)
GeorgianU+10A0U+10FF(88)
Hangul JamoU+1100U+11FF(256)
EthiopicU+1200U+137F(358)
Ethiopic SupplementU+1380U+139F(26)
CherokeeU+13A0U+13FF(92)
Unified Canadian Aboriginal SyllabicsU+1400U+167F(640)
OghamU+1680U+169F(29)
RunicU+16A0U+16FF(89)
TagalogU+1700U+171F(20)
HanunooU+1720U+173F(23)
BuhidU+1740U+175F(20)
TagbanwaU+1760U+177F(18)
KhmerU+1780U+17FF(114)
MongolianU+1800U+18AF(156)
Unified Canadian Aboriginal Syllabics ExtendedU+18B0U+18FF(70)
LimbuU+1900U+194F(68)
Tai LeU+1950U+197F(35)
New Tai LueU+1980U+19DF(83)
Khmer SymbolsU+19E0U+19FF(32)
BugineseU+1A00U+1A1F(30)
Tai ThamU+1A20U+1AAF(127)
Combining Diacritical Marks ExtendedU+1AB0U+1AFF(15)
BalineseU+1B00U+1B7F(121)
SundaneseU+1B80U+1BBF(64)
BatakU+1BC0U+1BFF(56)
LepchaU+1C00U+1C4F(74)
Ol ChikiU+1C50U+1C7F(48)
Sundanese SupplementU+1CC0U+1CCF(8)
Vedic ExtensionsU+1CD0U+1CFF(41)
Phonetic ExtensionsU+1D00U+1D7F(128)
Phonetic Extensions SupplementU+1D80U+1DBF(64)
Combining Diacritical Marks SupplementU+1DC0U+1DFF(58)
Latin Extended AdditionalU+1E00U+1EFF(256)
Greek ExtendedU+1F00U+1FFF(233)
General PunctuationU+2000U+206F(111)
Superscripts and SubscriptsU+2070U+209F(42)
Currency SymbolsU+20A0U+20CF(31)
Combining Diacritical Marks for SymbolsU+20D0U+20FF(33)
Letterlike SymbolsU+2100U+214F(80)
Number FormsU+2150U+218F(60)
ArrowsU+2190U+21FF(112)
Mathematical OperatorsU+2200U+22FF(256)
Miscellaneous TechnicalU+2300U+23FF(251)
Control PicturesU+2400U+243F(39)
Optical Character RecognitionU+2440U+245F(11)
Enclosed AlphanumericsU+2460U+24FF(160)
Box DrawingU+2500U+257F(128)
Block ElementsU+2580U+259F(32)
Geometric ShapesU+25A0U+25FF(96)
Miscellaneous SymbolsU+2600U+26FF(256)
DingbatsU+2700U+27BF(192)
Miscellaneous Mathematical Symbols-AU+27C0U+27EF(48)
Supplemental Arrows-AU+27F0U+27FF(16)
Braille PatternsU+2800U+28FF(256)
Supplemental Arrows-BU+2900U+297F(128)
Miscellaneous Mathematical Symbols-BU+2980U+29FF(128)
Supplemental Mathematical OperatorsU+2A00U+2AFF(256)
Miscellaneous Symbols and ArrowsU+2B00U+2BFF(206)
GlagoliticU+2C00U+2C5F(94)
Latin Extended-CU+2C60U+2C7F(32)
CopticU+2C80U+2CFF(123)
Georgian SupplementU+2D00U+2D2F(40)
TifinaghU+2D30U+2D7F(59)
Ethiopic ExtendedU+2D80U+2DDF(79)
Cyrillic Extended-AU+2DE0U+2DFF(32)
Supplemental PunctuationU+2E00U+2E7F(67)
CJK Radicals SupplementU+2E80U+2EFF(115)
Kangxi RadicalsU+2F00U+2FDF(214)
Ideographic Description CharactersU+2FF0U+2FFF(12)
CJK Symbols and PunctuationU+3000U+303F(64)
HiraganaU+3040U+309F(93)
KatakanaU+30A0U+30FF(96)
BopomofoU+3100U+312F(41)
Hangul Compatibility JamoU+3130U+318F(94)
KanbunU+3190U+319F(16)
Bopomofo ExtendedU+31A0U+31BF(27)
CJK StrokesU+31C0U+31EF(36)
Katakana Phonetic ExtensionsU+31F0U+31FF(16)
Enclosed CJK Letters and MonthsU+3200U+32FF(254)
CJK CompatibilityU+3300U+33FF(256)
CJK Unified Ideographs Extension AU+3400U+4DBF(6191)
Yijing Hexagram SymbolsU+4DC0U+4DFF(64)
CJK Unified IdeographsU+4E00U+9FFF(20941)
Yi SyllablesU+A000U+A48F(1165)
Yi RadicalsU+A490U+A4CF(55)
LisuU+A4D0U+A4FF(48)
VaiU+A500U+A63F(300)
Cyrillic Extended-BU+A640U+A69F(96)
BamumU+A6A0U+A6FF(88)
Modifier Tone LettersU+A700U+A71F(32)
Latin Extended-DU+A720U+A7FF(159)
Syloti NagriU+A800U+A82F(44)
Common Indic Number FormsU+A830U+A83F(10)
Phags-paU+A840U+A87F(56)
SaurashtraU+A880U+A8DF(81)
Devanagari ExtendedU+A8E0U+A8FF(30)
Kayah LiU+A900U+A92F(48)
RejangU+A930U+A95F(37)
Hangul Jamo Extended-AU+A960U+A97F(29)
JavaneseU+A980U+A9DF(91)
Myanmar Extended-BU+A9E0U+A9FF(31)
ChamU+AA00U+AA5F(83)
Myanmar Extended-AU+AA60U+AA7F(32)
Tai VietU+AA80U+AADF(72)
Meetei Mayek ExtensionsU+AAE0U+AAFF(23)
Ethiopic Extended-AU+AB00U+AB2F(32)
Latin Extended-EU+AB30U+AB6F(54)
Cherokee SupplementU+AB70U+ABBF(80)
Meetei MayekU+ABC0U+ABFF(56)
Hangul SyllablesU+AC00U+D7AF(2)
Hangul Jamo Extended-BU+D7B0U+D7FF(72)
High SurrogatesU+D800U+DB7F(2)
High Private Use SurrogatesU+DB80U+DBFF(2)
Low SurrogatesU+DC00U+DFFF(2)
Private Use AreaU+E000U+F8FF(2)
CJK Compatibility IdeographsU+F900U+FAFF(472)
Alphabetic Presentation FormsU+FB00U+FB4F(58)
Arabic Presentation Forms-AU+FB50U+FDFF(643)
Variation SelectorsU+FE00U+FE0F(16)
Vertical FormsU+FE10U+FE1F(10)
Combining Half MarksU+FE20U+FE2F(16)
CJK Compatibility FormsU+FE30U+FE4F(32)
Small Form VariantsU+FE50U+FE6F(26)
Arabic Presentation Forms-BU+FE70U+FEFF(141)
Halfwidth and Fullwidth FormsU+FF00U+FFEF(225)
SpecialsU+FFF0U+FFFF(7)
Linear B SyllabaryU+10000U+1007F(88)
Linear B IdeogramsU+10080U+100FF(123)
Aegean NumbersU+10100U+1013F(57)
Ancient Greek NumbersU+10140U+1018F(77)
Ancient SymbolsU+10190U+101CF(13)
Phaistos DiscU+101D0U+101FF(46)
LycianU+10280U+1029F(29)
CarianU+102A0U+102DF(49)
Coptic Epact NumbersU+102E0U+102FF(28)
Old ItalicU+10300U+1032F(36)
GothicU+10330U+1034F(27)
Old PermicU+10350U+1037F(43)
UgariticU+10380U+1039F(31)
Old PersianU+103A0U+103DF(50)
DeseretU+10400U+1044F(80)
ShavianU+10450U+1047F(48)
OsmanyaU+10480U+104AF(40)
ElbasanU+10500U+1052F(40)
Caucasian AlbanianU+10530U+1056F(53)
Linear AU+10600U+1077F(341)
Cypriot SyllabaryU+10800U+1083F(55)
Imperial AramaicU+10840U+1085F(31)
PalmyreneU+10860U+1087F(32)
NabataeanU+10880U+108AF(40)
HatranU+108E0U+108FF(26)
PhoenicianU+10900U+1091F(29)
LydianU+10920U+1093F(27)
Meroitic HieroglyphsU+10980U+1099F(32)
Meroitic CursiveU+109A0U+109FF(90)
KharoshthiU+10A00U+10A5F(65)
Old South ArabianU+10A60U+10A7F(32)
Old North ArabianU+10A80U+10A9F(32)
ManichaeanU+10AC0U+10AFF(51)
AvestanU+10B00U+10B3F(61)
Inscriptional ParthianU+10B40U+10B5F(30)
Inscriptional PahlaviU+10B60U+10B7F(27)
Psalter PahlaviU+10B80U+10BAF(29)
Old TurkicU+10C00U+10C4F(73)
Old HungarianU+10C80U+10CFF(108)
Rumi Numeral SymbolsU+10E60U+10E7F(31)
BrahmiU+11000U+1107F(109)
KaithiU+11080U+110CF(66)
Sora SompengU+110D0U+110FF(35)
ChakmaU+11100U+1114F(67)
MahajaniU+11150U+1117F(39)
SharadaU+11180U+111DF(94)
Sinhala Archaic NumbersU+111E0U+111FF(20)
KhojkiU+11200U+1124F(61)
MultaniU+11280U+112AF(38)
KhudawadiU+112B0U+112FF(69)
GranthaU+11300U+1137F(85)
TirhutaU+11480U+114DF(82)
SiddhamU+11580U+115FF(92)
ModiU+11600U+1165F(79)
TakriU+11680U+116CF(66)
AhomU+11700U+1173F(57)
Warang CitiU+118A0U+118FF(84)
Pau Cin HauU+11AC0U+11AFF(57)
CuneiformU+12000U+123FF(922)
Cuneiform Numbers and PunctuationU+12400U+1247F(116)
Early Dynastic CuneiformU+12480U+1254F(196)
Egyptian HieroglyphsU+13000U+1342F(1071)
Anatolian HieroglyphsU+14400U+1467F(583)
Bamum SupplementU+16800U+16A3F(569)
MroU+16A40U+16A6F(43)
Bassa VahU+16AD0U+16AFF(36)
Pahawh HmongU+16B00U+16B8F(127)
MiaoU+16F00U+16F9F(133)
Kana SupplementU+1B000U+1B0FF(2)
DuployanU+1BC00U+1BC9F(143)
Shorthand Format ControlsU+1BCA0U+1BCAF(4)
Byzantine Musical SymbolsU+1D000U+1D0FF(246)
Musical SymbolsU+1D100U+1D1FF(231)
Ancient Greek Musical NotationU+1D200U+1D24F(70)
Tai Xuan Jing SymbolsU+1D300U+1D35F(87)
Counting Rod NumeralsU+1D360U+1D37F(18)
Mathematical Alphanumeric SymbolsU+1D400U+1D7FF(996)
Sutton SignWritingU+1D800U+1DAAF(672)
Mende KikakuiU+1E800U+1E8DF(213)
Arabic Mathematical Alphabetic SymbolsU+1EE00U+1EEFF(143)
Mahjong TilesU+1F000U+1F02F(44)
Domino TilesU+1F030U+1F09F(100)
Playing CardsU+1F0A0U+1F0FF(82)
Enclosed Alphanumeric SupplementU+1F100U+1F1FF(173)
Enclosed Ideographic SupplementU+1F200U+1F2FF(57)
Miscellaneous Symbols and PictographsU+1F300U+1F5FF(766)
EmoticonsU+1F600U+1F64F(80)
Ornamental DingbatsU+1F650U+1F67F(48)
Transport and Map SymbolsU+1F680U+1F6FF(98)
Alchemical SymbolsU+1F700U+1F77F(116)
Geometric Shapes ExtendedU+1F780U+1F7FF(85)
Supplemental Arrows-CU+1F800U+1F8FF(148)
Supplemental Symbols and PictographsU+1F900U+1F9FF(15)
CJK Unified Ideographs Extension BU+20000U+2A6DF(42676)
CJK Unified Ideographs Extension CU+2A700U+2B73F(60)
CJK Unified Ideographs Extension DU+2B740U+2B81F(27)
CJK Unified Ideographs Extension EU+2B820U+2CEAF(2)
CJK Compatibility Ideographs SupplementU+2F800U+2FA1F(542)
TagsU+E0000U+E007F(97)
Variation Selectors SupplementU+E0100U+E01EF(240)
Supplementary Private Use Area-AU+F0000U+FFFFF(4)
Supplementary Private Use Area-BU+100000U+10FFFF(4)

Principles of the Unicode Standard

The Unicode Standard set forth the following fundamental principles:

  • Universal repertoire - Every writing system ever used shall be respected and represented in the standard
  • Logical order - In bidirectional text are the characters stored in logical order, not in a way that the representaion
  • Efficiency - The documentation must be efficient and complete.
  • Unification - Where different cultures or languages use the same character, it shall be only included once. This point is
  • Characters, not glyphs - Only characters, not glyphs shall be encoded. In a nutshell, glyphs are the actual graphical
  • Dynamic composition - New characters can be composed of other, already standardized characters. For example, the character “Ä” can be composed of an “A” and a dieresis sign (“ ¨ ”).
  • Semantics - Included characters must be well defined and distinguished from others.
  • Stability - Once defined characters shall never be removed or their codepoints reassigned. In the case of an error, a codepoint shall be deprecated.
  • Plain Text - Characters in the standard are text and never mark-up or metacharacters.
  • Convertibility - Every other used encoding shall be representable in terms of a Unicode encoding.

Note: Principle descriptions are from codepoints.net

Unicode Versions



Contributing

See the Awesome Unicode contribution guide for details on how to contribute.

Code of Conduct

See the Code of Conduct for details. Basically it comes down to:

In the interest of fostering an open and welcoming environment, we as contributors and maintainers pledge to making participation in our project and our community a harassment-free experience for everyone, regardless of age, body size, disability, ethnicity, gender identity and expression, level of experience, nationality, personal appearance, race, religion, or sexual identity and orientation.

License

CC0

To the extent possible under law, the contributors have waived all copyright and related or neighboring rights to this work. See the license file for details.

Built With LoveBuilt With LoveSearch by