Hive SQL 字符处理函数

提示

Hive SQL 教程欢迎使用。提供建议、纠错、催更等加作者微信: gr99123（备注：sql ）和关注公众号「盖若」ID: gairuo。跟作者学习，请进入 Python学习课程。欢迎关注作者出版的书籍：《深入浅出Pandas》和《Python之光》。

Hive支持以下内置的 String 类型数据处理的函数：

字符处理函数

返回类型

用法

描述

int

ascii(string str)

返回 str 的第一个字符的 ascii 编码数值。

string

base64(binary bin)

将参数从二进制转换为 64 进制字符串。(始于 Hive 0.12.0).

int character_length(string str) 返回 str 中包含的 UTF-8 字符数 (始于 Hive 2.2.0). 可简写为 char_length。

string chr(bigint|double A) 返回 ASCII 编码为 A 的字符。(始于 1.3.0 和 2.1.0)。如果 A 大于 256 则相当于 chr(A % 256)，例如 select chr(88) 返回 "X"

string

concat(string|binary A, string|binary B...)

返回按顺序连接作为参数传入的字符串或字节所产生的字符串或字节。例如，concat('foo', 'bar') 结果为 'foobar'。此函数可以接受任意数量的输入字符串。

array<struct<string,double>>

context_ngrams(array<array<string>>, array<string>, int K, int pf)

给定一个上下文字符串，从一组标记化句子中返回 top-k 上下文 N-gram。有关更多信息，请参阅 StatisticsAndDataMining。

string

concat_ws(string SEP, string A, string B...)

与上面的 concat() 类似，但带有自定义分隔符 SEP。

string

concat_ws(string SEP, array<string>)

与上面的 concat_ws() 类似，但使用字符串数组。(始于 Hive 0.9.0)

string

decode(binary bin, string charset)

使用提供的字符集将第一个参数解码为字符串 (以下编码之一 'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16')，如果任一参数为 null，则结果也将为 null. (始于 0.12.0.)

string

elt(N int,str1 string,str2 string,str3 string,...)

返回索引号处的字符串。例如，elt(2，'hello'，'world') 返回 'world'。如果 N 小于 1 或大于参数数，则返回 NULL。(参见)

binary

encode(string src, string charset)

使用提供的字符集将第一个参数编码为二进制 (以下编码之一 'US-ASCII', 'ISO-8859-1', 'UTF-8', 'UTF-16BE', 'UTF-16LE', 'UTF-16')，如果任一参数为 null，则结果也将为 null. (始于 0.12.0.)

int

field(val T,val1 T,val2 T,val3 T,...)

返回 val 在 val1,val2,val3,... 列的索引，不存在为 0，比如 field('world','say','hello','world') 返回 3。支持所有基元类型，使用 str.equals(x) 比较参数。如果 val 为 NULL，则返回值为 0。(参见)

int

find_in_set(string str, string strList)

返回 strList 中第一个出现的 str，其中 strList 是逗号分隔的字符串。如果任一参数为 null，则返回 null。如果第一个参数包含任何逗号，则返回 0。如 find_in_set('ab', 'abc,b,ab,c,def') 返回 3。

string

format_number(number x, int d)

将数字 X 格式化为形如 '#,###,###.##' ，舍入到 D 位小数，并以字符串形式返回结果。如果 D 为 0，则结果没有小数点或小数部分。(始于 0.10.0; Hive 0.14.0 修正了浮点类型的错误， Hive 0.14.0 添加了 decimal 类型支持)

string

get_json_object(string json_string, string path)

根据指定的 json 路径从 json 字符串中提取 json 对象，并返回提取的 json 对象的 json 字符串。如果输入的 json 字符串无效，它将返回 null。
NOTE: json 路径只能包含以下字符 [0-9a-z_], 即：没有大写或特殊字符，同样 key 不能以数字开头，这是由于对配置单元列名的限制。

boolean

in_file(string str, string filename)

如果字符串 str 在文件名中显示为整行，则返回 true。

int

instr(string str, string substr)

返回字符 substr 在 str 首次出现的位置。如果其中一个参数为 null，则返回 null；如果在str中找不到 substr，则返回 0。请注意，这不是从 0 开始的，从 1 开始。

int

length(string A)

返回字符串的长度。

int

locate(string substr, string str[, int pos])

返回字符在 substr 在 str 之后首次出现的位置。

string

lower(string A) lcase(string A)

返回将 A 的所有字符转换为小写所产生的字符串。如 lower('fOoBaR') 为 'foobar'.

string

lpad(string str, int len, string pad)

返回 str，左填充 pad，长度为 len。如果 str 长于len，则返回值将缩短为 len 个字符。如果填充字符串为空，则返回值为空。

string

ltrim(string A)

返回从 A 开头（左侧）修剪空格所产生的字符串。如 ltrim(' foobar ') 为 'foobar '.

array<struct<string,double>>

ngrams(array<array<string>>, int N, int K, int pf)

返回一组标记化句子中的 top-k N-gram，例如那些由 sentences() UDAF 返回的句子。见 StatisticsAndDataMining

int octet_length(string str) 返回以 UTF-8 编码保存字符串 str 所需的八位字节数 (始于 Hive 2.2.0). 请注意，octet_length(str) 可以大于 character_length(str)。

string

parse_url(string urlString, string partToExtract [, string keyToExtract])

从 URL 返回指定的部分。partToExtract 的有效值包括 HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, and USERINFO. 如 parse_url('http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1', 'HOST') 返回 'facebook.com'. 此外，还可以通过提供键作为第三个参数来提取查询中特定键的值，如 parse_url('http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1', 'QUERY', 'k1') 返回 'v1'.

string

printf(String format, Obj... args)

返回根据 printf 样式格式字符串格式化的输入 (始于 Hive 0.9.0).

string

quote(String text)

返回带引号的字符串 (包括任何单引号的转义字符，HIVE-4.0.0)

Input	Output
NULL	NULL
DONT	'DONT'
DON'T	'DON\'T'

string

regexp_extract(string subject, string pattern, int index)

返回使用模式提取的字符串。例如, regexp_extract('foothebar', 'foo(.*?)(bar)', 2) 返回 'bar.'，请注意，在使用预定义的字符类时需要谨慎：使用'\s'作为第二个参数将匹配字母s；'\\s'是匹配空格等所必需的。index参数是Java regex Matcher group() 方法索引。参见 docs/api/java/util/regex/Matcher.html 有关“index” 或 Java regex group() 方法的更多信息。

string

regexp_replace(string INITIAL_STRING, string PATTERN, string REPLACEMENT)

Returns the string resulting from replacing all substrings in INITIAL_STRING that match the java regular expression syntax defined in PATTERN with instances of REPLACEMENT. For example, regexp_replace("foobar", "oo|ar", "") returns 'fb.' Note that some care is necessary in using predefined character classes: using '\s' as the second argument will match the letter s; '\\s' is necessary to match whitespace, etc.

string

repeat(string str, int n)

Repeats str n times.

string replace(string A, string OLD, string NEW) Returns the string A with all non-overlapping occurrences of OLD replaced with NEW (as of Hive 1.3.0 and 2.1.0). Example: select replace("ababab", "abab", "Z"); returns "Zab".

string

reverse(string A)

Returns the reversed string.

string

rpad(string str, int len, string pad)

Returns str, right-padded with pad to a length of len. If str is longer than len, the return value is shortened to len characters. In case of empty pad string, the return value is null.

string

rtrim(string A)

Returns the string resulting from trimming spaces from the end(right hand side) of A. For example, rtrim(' foobar ') results in ' foobar'.

array<array<string>>

sentences(string str, string lang, string locale)

Tokenizes a string of natural language text into words and sentences, where each sentence is broken at the appropriate sentence boundary and returned as an array of words. The 'lang' and 'locale' are optional arguments. For example, sentences('Hello there! How are you?') returns ( ("Hello", "there"), ("How", "are", "you") ).

string

space(int n)

Returns a string of n spaces.

array

split(string str, string pat)

Splits str around pat (pat is a regular expression).

map<string,string>

str_to_map(text[, delimiter1, delimiter2])

Splits text into key-value pairs using two delimiters. Delimiter1 separates text into K-V pairs, and Delimiter2 splits each K-V pair. Default delimiters are ',' for delimiter1 and ':' for delimiter2.

string

substr(string|binary A, int start) substring(string|binary A, int start)

Returns the substring or slice of the byte array of A starting from start position till the end of string A. For example, substr('foobar', 4) results in 'bar' (see [http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_substr]).

string

substr(string|binary A, int start, int len) substring(string|binary A, int start, int len)

Returns the substring or slice of the byte array of A starting from start position with length len. For example, substr('foobar', 4, 1) results in 'b' (see [http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_substr]).

string substring_index(string A, string delim, int count) Returns the substring from string A before count occurrences of the delimiter delim (as of Hive 1.3.0). If count is positive, everything to the left of the final delimiter (counting from the left) is returned. If count is negative, everything to the right of the final delimiter (counting from the right) is returned. Substring_index performs a case-sensitive match when searching for delim. Example: substring_index('www.apache.org', '.', 2) = 'www.apache'.

string

Translates the input string by replacing the characters present in the from string with the corresponding characters in the to string. This is similar to the translate function in PostgreSQL. If any of the parameters to this UDF are NULL, the result is NULL as well. (Available as of Hive 0.10.0, for string types)

Char/varchar support added as of Hive 0.14.0.

string

trim(string A)

Returns the string resulting from trimming spaces from both ends of A. For example, trim(' foobar ') results in 'foobar'

binary

unbase64(string str)

Converts the argument from a base 64 string to BINARY. (As of Hive 0.12.0.)

string

upper(string A) ucase(string A)

Returns the string resulting from converting all characters of A to upper case. For example, upper('fOoBaR') results in 'FOOBAR'.

string initcap(string A) Returns string, with the first letter of each word in uppercase, all other letters in lowercase. Words are delimited by whitespace. (As of Hive 1.1.0.)

int levenshtein(string A, string B) Returns the Levenshtein distance between two strings (as of Hive 1.2.0). For example, levenshtein('kitten', 'sitting') results in 3.

string soundex(string A) Returns soundex code of the string (as of Hive 1.2.0). For example, soundex('Miller') results in M460.

数据脱敏屏蔽功能

Hive支持以下内置数据屏蔽功能：

Return Type	Name(Signature)	Description
string	mask(string str[, string upper[, string lower[, string number]]])	Returns a masked version of str (as of Hive 2.1.0). By default, upper case letters are converted to "X", lower case letters are converted to "x" and numbers are converted to "n". For example mask("abcd-EFGH-8765-4321") results in xxxx-XXXX-nnnn-nnnn. You can override the characters used in the mask by supplying additional arguments: the second argument controls the mask character for upper case letters, the third argument for lower case letters and the fourth argument for numbers. For example, mask("abcd-EFGH-8765-4321", "U", "l", "#") results in llll-UUUU-####-####.
string	mask_first_n(string str[, int n])	Returns a masked version of str with the first n values masked (as of Hive 2.1.0). Upper case letters are converted to "X", lower case letters are converted to "x" and numbers are converted to "n". For example, mask_first_n("1234-5678-8765-4321", 4) results in nnnn-5678-8765-4321.
string	mask_last_n(string str[, int n])	Returns a masked version of str with the last n values masked (as of Hive 2.1.0). Upper case letters are converted to "X", lower case letters are converted to "x" and numbers are converted to "n". For example, mask_last_n("1234-5678-8765-4321", 4) results in 1234-5678-8765-nnnn.
string	mask_show_first_n(string str[, int n])	Returns a masked version of str, showing the first n characters unmasked (as of Hive 2.1.0). Upper case letters are converted to "X", lower case letters are converted to "x" and numbers are converted to "n". For example, mask_show_first_n("1234-5678-8765-4321", 4) results in 1234-nnnn-nnnn-nnnn.
string	mask_show_last_n(string str[, int n])	Returns a masked version of str, showing the last n characters unmasked (as of Hive 2.1.0). Upper case letters are converted to "X", lower case letters are converted to "x" and numbers are converted to "n". For example, mask_show_last_n("1234-5678-8765-4321", 4) results in nnnn-nnnn-nnnn-4321.
string	mask_hash(string\|char\|varchar str)	Returns a hashed value based on str (as of Hive 2.1.0). The hash is consistent and can be used to join masked values together across tables. This function returns null for non-string types.

< if 条件 Hive SQL 教程表生成函数（UDTF） >

更新时间：2024-11-23 15:19:50 标签：hql 字符 sql