utf 8 - How to detect Chinese Character in MySQL? -
i need calculate number of chinese in list of columns. example, if "北京实业" occur, 4 characters in chinese count once since occurs in column.
is there specific code figure out?
select count(*) tbl hex(col) regexp '^(..)*(e[2-9f]|f0a)' will count number of record chinese characters in column col.
problems:
- i not sure ranges of hex represent chinese.
- the test may include korean , japanese. ("cjk")
- in mysql 4-byte chinese characters need
utf8mb4instead ofutf8.
elaboration
i assuming column in table character set utf8. in utf8 encoding, chinese characters begin byte between hex e2 , e9, or ef, or f0. starting hex e 3 bytes long, not checking length; f0 ones 4 bytes.
the regexp starts ^(..)*, meaning "from start of string (^), locate 0 or more (*) 2-character (..) values. after should either e-something or f0a. after that, can occur. e-something is, more specifically, e followed of 2,3,4,5,6,7,8,9, or f.
picked @ random, see 草 encodes 3 hex bytes e88d89, , 𠜎 encodes 4 hex bytes f0a09c8e.
i not know of better way check string specific language.
as found, regexp can rather slow.
this regexp over-kill, in non-chinese characters may captured.
Comments
Post a Comment