utf 8 - How to detect Chinese Character in MySQL? -


i need calculate number of chinese in list of columns. example, if "北京实业" occur, 4 characters in chinese count once since occurs in column.

is there specific code figure out?

select count(*)     tbl     hex(col) regexp '^(..)*(e[2-9f]|f0a)' 

will count number of record chinese characters in column col.

problems:

  • i not sure ranges of hex represent chinese.
  • the test may include korean , japanese. ("cjk")
  • in mysql 4-byte chinese characters need utf8mb4 instead of utf8.

elaboration

i assuming column in table character set utf8. in utf8 encoding, chinese characters begin byte between hex e2 , e9, or ef, or f0. starting hex e 3 bytes long, not checking length; f0 ones 4 bytes.

the regexp starts ^(..)*, meaning "from start of string (^), locate 0 or more (*) 2-character (..) values. after should either e-something or f0a. after that, can occur. e-something is, more specifically, e followed of 2,3,4,5,6,7,8,9, or f.

picked @ random, see encodes 3 hex bytes e88d89, , 𠜎 encodes 4 hex bytes f0a09c8e.

i not know of better way check string specific language.

as found, regexp can rather slow.

this regexp over-kill, in non-chinese characters may captured.


Comments

Popular posts from this blog

Hatching array of circles in AutoCAD using c# -

ios - UITEXTFIELD InputView Uipicker not working in swift -