utf 8 - How to detect Chinese Character in MySQL? -
i need calculate number of chinese in list of columns. example, if "北京实业" occur, 4 characters in chinese count once since occurs in column.
is there specific code figure out?
select count(*) tbl hex(col) regexp '^(..)*(e[2-9f]|f0a)'
will count number of record chinese characters in column col
.
problems:
- i not sure ranges of hex represent chinese.
- the test may include korean , japanese. ("cjk")
- in mysql 4-byte chinese characters need
utf8mb4
instead ofutf8
.
elaboration
i assuming column in table character set utf8
. in utf8 encoding, chinese characters begin byte between hex e2 , e9, or ef, or f0. starting hex e 3 bytes long, not checking length; f0 ones 4 bytes.
the regexp starts ^(..)*
, meaning "from start of string (^
), locate 0 or more (*
) 2-character (..
) values. after should either e
-something or f0a
. after that, can occur. e-something is, more specifically, e
followed of 2,3,4,5,6,7,8,9, or f.
picked @ random, see 草
encodes 3 hex bytes e88d89
, , 𠜎
encodes 4 hex bytes f0a09c8e
.
i not know of better way check string specific language.
as found, regexp can rather slow.
this regexp over-kill, in non-chinese characters may captured.
Comments
Post a Comment