php - How do I speed up this BIT_COUNT query for hamming distance? -

- September 15, 2015

i have php script checks hamming distance between 2 still photos taken security camera.

the table mysql 2.4m rows, , consists of key , 4 int(10)s. int(10)s have been indexed individually, together, , key, don't have significant evidence combination faster others. can try again if suggest so.

the hamming weights calculated transforming image 8x16 pixels, , each quarter of bits stored in column, phash0, phash1... etc.

there 2 ways have written it. first way use nested derived tables. theoretically, each derivation should have lesser data check it's predecessor. query prepared statement, , ? fields phash[0-3] of file i'm checking against.

select     `key`,     bit_count(t3.phash3 ^ ?) + t3.bc2 bc3       (select       *,       bit_count(t2.phash2 ^ ?) + t2.bc1 bc2           (select         *,         bit_count(t1.phash1 ^ ?) + t1.bc0 bc1               (select           `key`,           phash0,           phash1,           phash2,           phash3,           bit_count(phash0 ^ ?) bc0                   files                   not phash0 null ,           bit_count(phash0 ^ ?) < 4) t1               bit_count(t1.phash1 ^ ?) + t1.bc0 < 4) t2           bit_count(t2.phash2 ^ ?) + t2.bc1 < 4) t3       bit_count(t3.phash3 ^ ?) + t3.bc2 < 4

the second approach bit more direct. did of work @ once.

select     `key`,       files       not phash0 null ,     bit_count(phash0 ^ ?) + bit_count(phash1 ^ ?) + bit_count(phash2 ^     ?) + bit_count(phash3 ^ ?) < 4

the first query faster on large recordsets, while second faster on smaller recordsets, neither exceed 1-1/3 seconds per compare on 2.4m records.

do see way of tweaking process make faster? suggestions can tried, such changing datatypes or indexes.

the setup win7x64, mysql/5.6.6 , innodb, nginx/1.99, php-cgi/7.0.0 zend enabled. script called webpage, , has buffering turned off immediate feedback.

edit:

it might work better if change 4 32-bit integers 1 binary(16), change compares 4 one, i'd have convert 4 parameters 128-bit character, php won't do. if there fast way combine them, might squeeze bit more time off.

edit accepted answer has increased speed ~500%. quick synopsis of our hypothesis: bitcount of phash "a" within phash "b" +/- hamming distance.

special @duskwuff tenacity , patience. cheers @duskwuff!

edit recent query:

select   files.`key`,    bit_count(? ^ phash0) + bit_count(? ^ phash1) +   bit_count(? ^ phash2) + bit_count(? ^ phash3) bc       files force index (bitcount)       bitcount between ? , ?    , bit_count(? ^ phash0) + bit_count(? ^ phash1) +   bit_count(? ^ phash2) + bit_count(? ^ phash3) <= ?   order bit_count(? ^ phash0) + bit_count(? ^ phash1) +   bit_count(? ^ phash2) + bit_count(? ^ phash3)

where first 4 "?" represent 4 32-bit hashes of file being checked, next 2 "?" represent pre-calculated bitcount of file +/- desired hamming distance, , last "?" represents hamming distance. order clause necessary bring closest matches top, limit 1 clause return best match. there b-tree index on bitcount field.

the dispersion of bitcounts 2.4-million files fell bell curve, having 3 or 4 on extremes, 70,000 in center. if given file bitcount of 64 (which worst-case), looking files within hamming distance of 3 means comparing 20% of files (490,000 in case), whereas looking hamming distance of 0 compare 2.8% of records (70,000, of course).

observe bit_count(a ^ b) bounded below difference between bit_count(a) , bit_count(b). (that is, @ least equal difference, , may greater.) if precalculate total bit count each row, can use rule out rows have total bit count that's far away target. better, can create index on column, , index used.

what i'd have in mind along lines of:

alter table files add column totalbits integer; create index totalbits_index on files (totalbits);  update files set totalbits = bit_count(phash1) + bit_count(phash2)                            + bit_count(phash3) + bit_count(phash4);  select `key` files (totalbits between … , …) , …

note that, in place, there's no need split hashes 4 chunks. combining them single column make things easier.

Search This Blog

JAV

php - How do I speed up this BIT_COUNT query for hamming distance? -

Comments

Post a Comment

Popular posts from this blog

ios - UITEXTFIELD InputView Uipicker not working in swift -

Hatching array of circles in AutoCAD using c# -

jqgrid - how to change theme of grid using jqwidgets -