LuaWebGen

UTF-8

v1.3 The UTF-8 module, available through the utf8 global, contains some UTF-8 related helper functions.


Functions

Note: Positions and lengths are given in bytes, unless otherwise specified.

codepointToString

string = utf8.codepointToString( codepoint )
utf8.codepointToString( codepoint, outputArray )

Convert a single Unicode codepoint to a string, optionally adding the result to an array. Raises an error if the codepoint is outside the valid range (0..0x10FFFF).

getCharacterLength

length = utf8.getCharacterLength( string [, position=1 ] )

Get the amount of bytes the character at position takes up (between 1 and 4). Returns nil if the string is invalid at position. Examples:

local s = "aÜx"
print(utf8.getCharacterLength(s, 1)) -- 1 (a)
print(utf8.getCharacterLength(s, 2)) -- 2 (Ü)
print(utf8.getCharacterLength(s, 4)) -- 1 (x)

getCodepointAndLength

codepoint, length = utf8.getCodepointAndLength( string [, position=1 ] )

Get the codepoint for, and amount of bytes taken up by, the character at position. Returns nil if the string is invalid at position.

getLength

length = utf8.getLength( string [, startPosition=1 ] )

Get the total length of a string in characters starting at startPosition. Returns nil and the first error position if the string isn't a valid UTF-8 string. Example:

print(utf8.getLength("aÜx"))    -- 4
print(utf8.getLength("a\255x")) -- nil, 2

getStartOfCharacter

startPosition = utf8.getStartOfCharacter( string, position )

Get the position where the character at position begins. Returns nil if the string is invalid at position. Example:

print(utf8.getStartOfCharacter("aÜx", 3)) -- 2

Constants

CHARACTER_PATTERN

utf8.CHARACTER_PATTERN = "[%z\1-\127\194-\244][\128-\191]*"

A pattern that will match one UTF-8 encoded character.

local s = "aÜx"
print(s:match(utf8.CHARACTER_PATTERN, 2)) -- Ü

Page updated: 2021-07-08