UTF-8
v1.3
The UTF-8 module, available through the utf8
global, contains some UTF-8 related helper functions.
Functions
Note: Positions and lengths are given in bytes, unless otherwise specified.
codepointToString
string = utf8.codepointToString( codepoint )
utf8.codepointToString( codepoint, outputArray )
Convert a single Unicode codepoint to a string, optionally adding the result to an array. Raises an error if the codepoint is outside the valid range (0..0x10FFFF).
getCharacterLength
length = utf8.getCharacterLength( string [, position=1 ] )
Get the amount of bytes the character at position takes up (between 1 and 4). Returns nil if the string is invalid at position. Examples:
local s = "aÜx"
print(utf8.getCharacterLength(s, 1)) -- 1 (a)
print(utf8.getCharacterLength(s, 2)) -- 2 (Ü)
print(utf8.getCharacterLength(s, 4)) -- 1 (x)
getCodepointAndLength
codepoint, length = utf8.getCodepointAndLength( string [, position=1 ] )
Get the codepoint for, and amount of bytes taken up by, the character at position. Returns nil if the string is invalid at position.
getLength
length = utf8.getLength( string [, startPosition=1 ] )
Get the total length of a string in characters starting at startPosition. Returns nil and the first error position if the string isn't a valid UTF-8 string. Example:
print(utf8.getLength("aÜx")) -- 4
print(utf8.getLength("a\255x")) -- nil, 2
getStartOfCharacter
startPosition = utf8.getStartOfCharacter( string, position )
Get the position where the character at position begins. Returns nil if the string is invalid at position. Example:
print(utf8.getStartOfCharacter("aÜx", 3)) -- 2
Constants
CHARACTER_PATTERN
utf8.CHARACTER_PATTERN = "[%z\1-\127\194-\244][\128-\191]*"
A pattern that will match one UTF-8 encoded character.
local s = "aÜx"
print(s:match(utf8.CHARACTER_PATTERN, 2)) -- Ü
Page updated: 2022-04-13