php - How do I speed up this BIT_COUNT query for hamming distance? -
i have php script checks hamming distance between 2 still photos taken security camera.
the table mysql 2.4m rows, , consists of key , 4 int(10)s. int(10)s have been indexed individually, together, , key, don't have significant evidence combination faster others. can try again if suggest so.
the hamming weights calculated transforming image 8x16 pixels, , each quarter of bits stored in column, phash0, phash1... etc.
there 2 ways have written it. first way use nested derived tables. theoretically, each derivation should have lesser data check it's predecessor. query prepared statement, , ? fields phash[0-3] of file i'm checking against.
select `key`, bit_count(t3.phash3 ^ ?) + t3.bc2 bc3 (select *, bit_count(t2.phash2 ^ ?) + t2.bc1 bc2 (select *, bit_count(t1.phash1 ^ ?) + t1.bc0 bc1 (select `key`, phash0, phash1, phash2, phash3, bit_count(phash0 ^ ?) bc0 files not phash0 null , bit_count(phash0 ^ ?) < 4) t1 bit_count(t1.phash1 ^ ?) + t1.bc0 < 4) t2 bit_count(t2.phash2 ^ ?) + t2.bc1 < 4) t3 bit_count(t3.phash3 ^ ?) + t3.bc2 < 4
the second approach bit more direct. did of work @ once.
select `key`, files not phash0 null , bit_count(phash0 ^ ?) + bit_count(phash1 ^ ?) + bit_count(phash2 ^ ?) + bit_count(phash3 ^ ?) < 4
the first query faster on large recordsets, while second faster on smaller recordsets, neither exceed 1-1/3 seconds per compare on 2.4m records.
do see way of tweaking process make faster? suggestions can tried, such changing datatypes or indexes.
the setup win7x64, mysql/5.6.6 , innodb, nginx/1.99, php-cgi/7.0.0 zend enabled. script called webpage, , has buffering turned off immediate feedback.
edit:
it might work better if change 4 32-bit integers 1 binary(16), change compares 4 one, i'd have convert 4 parameters 128-bit character, php won't do. if there fast way combine them, might squeeze bit more time off.
edit accepted answer has increased speed ~500%. quick synopsis of our hypothesis: bitcount of phash "a" within phash "b" +/- hamming distance.
special @duskwuff tenacity , patience. cheers @duskwuff!
edit recent query:
select files.`key`, bit_count(? ^ phash0) + bit_count(? ^ phash1) + bit_count(? ^ phash2) + bit_count(? ^ phash3) bc files force index (bitcount) bitcount between ? , ? , bit_count(? ^ phash0) + bit_count(? ^ phash1) + bit_count(? ^ phash2) + bit_count(? ^ phash3) <= ? order bit_count(? ^ phash0) + bit_count(? ^ phash1) + bit_count(? ^ phash2) + bit_count(? ^ phash3)
where first 4 "?" represent 4 32-bit hashes of file being checked, next 2 "?" represent pre-calculated bitcount of file +/- desired hamming distance, , last "?" represents hamming distance. order clause necessary bring closest matches top, limit 1 clause return best match. there b-tree index on bitcount
field.
the dispersion of bitcounts 2.4-million files fell bell curve, having 3 or 4 on extremes, 70,000 in center. if given file bitcount of 64 (which worst-case), looking files within hamming distance of 3 means comparing 20% of files (490,000 in case), whereas looking hamming distance of 0 compare 2.8% of records (70,000, of course).
observe bit_count(a ^ b)
bounded below difference between bit_count(a)
, bit_count(b)
. (that is, @ least equal difference, , may greater.) if precalculate total bit count each row, can use rule out rows have total bit count that's far away target. better, can create index on column, , index used.
what i'd have in mind along lines of:
alter table files add column totalbits integer; create index totalbits_index on files (totalbits); update files set totalbits = bit_count(phash1) + bit_count(phash2) + bit_count(phash3) + bit_count(phash4); select `key` files (totalbits between … , …) , …
note that, in place, there's no need split hashes 4 chunks. combining them single column make things easier.
Comments
Post a Comment