simhash - Online in the Cloud

Run simhash in OnWorks free hosting provider over Ubuntu Online, Fedora Online, Windows online emulator or MAC OS online emulator

This is the command simhash that can be run in the OnWorks free hosting provider using one of our multiple free online workstations such as Ubuntu Online, Fedora Online, Windows online emulator or MAC OS online emulator

Run in Ubuntu Run in Fedora Run in Windows Sim Run in MACOS Sim

PROGRAM:

NAME

simhash - file similarity hash tool

SYNOPSIS

simhash [ -s nshingles ] [ -f nfeatures ] [ file ]
simhash [ -s nshingles ] [ -f nfeatures ] -w file ...
simhash [ -s nshingles ] [ -f nfeatures ] -m file ...
simhash -c hashfile hashfile

DESCRIPTION

This program is used to compute and compare similarity hashes of files. A similarity hash
is a chunk of data that has the property that some distance metric between files is
proportional to some distance metric between the hashes. Typically the similarity hash
will be much smaller than the file itself.

The algorithm used by simhash is Manassas' "shingleprinting" algorithm (see BIBLIOGRAPHY
below): take a hash of every m-byte subsequence of the file, and retain the n of these
hashes that are numerically smallest. The size of the intersection of the hash sets of
two files gives a statistically good estimate of the similarity of the files as a whole.

In its default mode, simhash will compute the similarity hash of its file argument (or
stdin) and write this hash to its standard output. When invoked with the -w argument (see
below), simhash will compute similarity hashes of all of its file arguments in "batch
mode". When invoked with the -m argument (see below), simhash will compare all the given
files using similarity hashes in "match mode". Finally, when invoked with the -c argument
(see below), simhash will report the degree of similarity between two hashes.

OPTIONS

-f feature-count
When computing a similarity hash, retain at most feature-count significant hashes
from the target file. The default is 128 features. Larger feature counts will
give higher resolution in differences between files, will increase the size of the
similarity hash proportionally to the feature count, and will increase similarity
hash computation time slightly.

-s shingle-size
When computing a similarity hash, use hashes of samples consisting of shingle-size
consecutive bytes drawn from the target file. The default is 8 bytes, the minimum
is 4 bytes. Larger shingle sizes will emphasize the differences between files more
and will slow the similarity hash computation proportionally to the shingle size.

-c hashfile1 hashfile2
Display the distance (normalized to the range 0..1) between the similarity hash
stored in hashfile1 and the similarity hash stored in hashfile2.

-w file ...
Write the similarity hash of each of the file arguments to file.sim.

-m file ...
Compute the similarity hash of each of the file arguments, and output a similarity
matrix for those files.

Use simhash online using onworks.net services