ImgArchive uniquely identifies all images in the archive as part of the import process. This is carried out for three main reasons. The first being to stop inadvertent duplication of images in the archive, the second being to uniquely identify an image, so for example the metadata text can be associated with its image using this identification, and the third reason is to identify a corrupted image file. This can be thought of as a form of finger printing.

The primary way for unique this identification is made is by using the SHA256 algorithm and a secondary identification is a Cyclic redundancy checksum which is no were near as strong as the SHA256 algorithm but easier to calculate.

This sections below give some background on the terms used in the section on the unique identifier.

Cyclic redundancy checksum (CRC)

CRC is an error-detecting code commonly used in digital networks and storage devices to detect accidental changes to raw data. Blocks of data entering these systems get a short check value (CRC) attached, based on the remainder of a polynomial division of their contents. On retrieval, the calculation is repeated and, in the event the check values do not match, corrective action can be taken against data corruption. The Cyclic redundancy checksum is strong checksum. The version used in ImgArchive is the same is the one used by the PNG image format. In the future ImgArchive will use a 64 bit CRC on the 64 bit versions of the software.

Cryptographic Hash Functions

Cryptographic hash functions have many information security applications, such as digital signatures, message authentication codes (MACs), and other forms of authentication.

However ImgArchive uses hashing functions to be used as ordinary hash functions for fingerprinting, to detect duplicate data and uniquely identify files. Plus, as a checksum to detect accidental data corruption.

Most cryptographic hash functions are designed to take a block of data in this case an image file of any length as input and produce a fixed-length hash value.

A cryptographic hash function must be able to withstand all known types of cryptanalytic attack. In theoretical cryptography.

File or data identifiers

A message hash can also serve as a means of reliably identifying a file; several source code management systems, including Git, Mercurial and Monotone, use the sha1sum of various types of content to uniquely identify them. Hashes are used to identify files on peer-to-peer file sharing networks

They tend to be used in contexts where it is necessary for users to protect themselves against the possibility of forgery by potentially malicious participants.

However this is not a requirement for the archive and it can still be used as a checksum to verify data integrity, but only against unintentional corruption.

Why use two checksums

A CRC checksum is quick to calculate and in this implementation 32 bits. That means 4,294,967,295 combinations of checksums can be made. However the likelihood of having the same checksum for two different images files is to frequent when having a CRC as the only means of guaranteeing a unique identification. Something like a hash is more appropriate. However, only if a two CRC match does the hash need to be used. ImgArchive contains a database of CRCs to be used as a lookup of all images in the archive. Any image files with duplicates CRCs are stored together but will have a different hash for their unique identification.

The 3 most used algorithms used for file hashing

MD5: The fastest and shortest generated hash (16 bytes). The probability of just two hashes accidentally colliding is approximately: 1.47*10^-29.

SHA1: Is generally 20% slower than md5, the generated hash is a bit longer than MD5 (20 bytes). The probability of just two hashes accidentally colliding is approximately: 1*10^-45

SHA256: The slowest, usually 60% slower than md5, and the longest generated hash (32 bytes). The probability of just two hashes accidentally colliding is approximately: 4.3*10^-60.

MD5

MD5 was designed by Ronald Rivest in 1991 to replace an earlier hash function, MD4, and was specified in 1992 as RFC 1321.

The MD5 algorithm is a widely used hash function producing a 128-bit hash value. Although MD5 was initially designed to be used as a cryptographic hash function, it has been found to suffer from extensive vulnerabilities. Collisions against MD5 can be calculated within seconds which makes the algorithm unsuitable for most use cases where a cryptographic hash is required. MD5 produces a digest of 128 bits (16 bytes).

The probability for the clash for the md5

The probability for the MD5 algorithm to generate the same number for different images is extremely low. You need to hash about 2^64 values to get a single collision among them, on average, if you don't try to deliberately create collisions. Hash collisions are very similar to the Birthday problem see below.

If you look at two arbitrary values, the collision probability is only 2^64 is 18,446,744,073,709,551,616. That's 17,179,869,184 gigabytes. If in seconds then 584,942,400,000 years from now.

The problem with MD5 is that it is possible to craft two different texts that hash to the same value. But this requires a deliberate attack, and doesn't happen accidentally. And even with a deliberate attack it's currently not feasible to get a plain text matching a given hash.

Resilient against intentional corruption ImgArchive is not designed to be resilient against intentional corruption. Like most hash functions, MD5 can be reversed by brute-force attack. However MD5 is used for digital certificates or identity certificates and to intentional attack a MD5 will take time and a knowledge of cryptographic cyphers.

SHA-1

SHA-1 was developed as part of the U.S. Government's Capstone project. It was withdrawn by the NSA shortly after publication and was superseded by the revised version, published in 1995. Collisions against the full SHA-1 algorithm can be produced using the shattered attack and the hash function should be considered broken. SHA-1 produces a hash digest of 160 bits (20 bytes).

SHA-256

SHA-256 is one of the successor hash functions to SHA-1 and is one of the strongest hash functions available. SHA-256 is one of a family of SHA-2 hash functions. SHA-256 is not much more complex to code than SHA-1, and has not yet been compromised in any way. It is defined in the NIST (National Institute of Standards and Technology) standard ‘FIPS 180-4’. NIST also provide a number of test vectors to verify correctness of implementation. SHA-256 and SHA-512 are novel hash functions computed with eight 32-bit and 64-bit words, respectively. They use different shift amounts and additive constants, but their structures are otherwise virtually identical, differing only in the number of rounds.

Is it safe to ignore the possibility of SHA collisions in practice?

The usual answer goes thus: what is the probability that a rogue asteroid crashes on Earth within the next second, obliterating civilization-as-we-know-it, and killing off a few billion people? It can be argued that any unlucky event with a probability lower than that is not actually very important.

For instance, if SHA-256 generates one billion messages then the probability is about 4.3*10^-60.

A rogue asteroid crash happens about once every 30 million years on average. This leads to a probability of such an event occurring in the next second to about 10^-15. That's 45 orders of magnitude more probable than the SHA-256 collision. Briefly stated, if you find SHA-256 collisions scary then your priorities are wrong.

In a security setup, where an attacker gets to choose the messages which will be hashed, then the attacker may use substantially more than a billion messages; however, you will find that the attacker's success probability will still be vanishingly small. The whole point of using a hash function with a 256-bit output: so that risks of collision can be neglected.

The birthday problem

In probability theory, the birthday problem or birthday paradox concerns the probability that, in a set of a number randomly chosen people, some pair of them will have the same birthday. By the pigeonhole principle, the probability reaches 100% when the number of people reaches 367 (since there are only 366 possible birthdays, including February 29). However, 99.9% probability is reached with just 70 people, and 50% probability with 23 people. These conclusions are based on the assumption that each day of the year (except February 29) is equally probable for a birthday.

ImgArchive Wiki : Unique Identification