Is there really no use for MD5 anymore?How many trials does it take to break HMAC-MD5?Does it matter if I...

Find a stone which is not the lightest one

What to do with someone that cheated their way through university and a PhD program?

Can a Bard use the Spell Glyph option of the Glyph of Warding spell and cast a known spell into the glyph?

Is there a word for the censored part of a video?

As an international instructor, should I openly talk about my accent?

Older movie/show about humans on derelict alien warship which refuels by passing through a star

How long after the last departure shall the airport stay open for an emergency return?

Mistake in years of experience in resume?

Co-worker works way more than he should

What is the best way to deal with NPC-NPC combat?

What does a straight horizontal line above a few notes, after a changed tempo mean?

Multiple fireplaces in an apartment building?

Is there really no use for MD5 anymore?

Contradiction proof for inequality of P and NP?

Is it acceptable to use working hours to read general interest books?

Will I lose my paid in full property

Does a large simulator bay have standard public address announcements?

Can a level 2 Warlock take one level in rogue, then continue advancing as a warlock?

A faster way to compute the largest prime factor

What makes accurate emulation of old systems a difficult task?

Are there moral objections to a life motivated purely by money? How to sway a person from this lifestyle?

How can I wire a 9-position switch so that each position turns on one more LED than the one before?

How important is it that $TERM is correct?

Retract an already submitted recommendation letter (written for an undergrad student)

Is there really no use for MD5 anymore?

How many trials does it take to break HMAC-MD5?Does it matter if I publish only publish good or bad hashes after recovering from a hack?What differentiates a password hash from a cryptographic hash besides speed?A question regarding relevance of vulnerability of MD5 when linking multiple records togetherCould a very long password theoretically eliminate the need for a slow hash?TCR hash functions from MD5Collision attacks on digital signaturesChecksum vs. non-cryptographic hashIs using a broken SHA-1 for password hashing secure?Unable to implement Client and Server Side Hashing (Validation problem)Keyspace in truncated MD5 hash?Very difficult hashing function?

I read an article about password schemes that makes two seemingly conflicting claims:

MD5 is broken; it’s too slow to use as a general purpose hash; etc

The problem is that MD5 is fast

I know that MD5 should not be used for password hashing, and that it also should not be used for integrity checking of documents. There are way too many sources citing MD5 preimaging attacks and MD5s low computation time.

However, I was under the impression that MD5 still can be used as a non-cryptgraphic hash function:

Identifying malicious files, such as when Linux Mint's download servers were compromised and an ISO file was replaced by a malicious one; in this case you want to be sure that your file doesn't match; collision attacks aren't a vector here.

Finding duplicate files. By MD5-summing all files in a directory structure it's easy to find identical hashes. The seemingly identical files can then be compared in full to check if they are really identical. Using SHA512 would make the process slower, and since we compare files in full anyway there is no risk in a potential false positive from MD5. (In a way, this would be creating a rainbow table where all the files are the dictionary)

There are checksums of course, but from my experience, the likelihood of finding two different files with the same MD5 hash is very low as long as we can rule out foul play.

When the password scheme article states that "MD5 is fast", it clearly refers to the problem that hashing MD5 is too cheap when it comes to hashing a large amount of passwords to find the reverse of a hash. But what does it mean when it says that "[MD5 is] too slow to use as a general purpose hash"? Are there faster standardized hashes to compare files, that still have a reasonably low chance of collision?

asked 20 hours ago

jornane

18814

4

$begingroup$
The collision attack is the problem, however, MD5 still has pre-image resistance.
$endgroup$
– kelalaka
20 hours ago

1

$begingroup$
"Using SHA512 would make the process slower ..." - On my system, openssl speed reports 747MB/s for MD5, and 738MB/s for SHA-512, so that's barely a difference ;)
$endgroup$
– marcelm
7 hours ago

$begingroup$
I believe MD5 can still be used as a PRF.
$endgroup$
– jww
1 hour ago

add a comment |

I read an article about password schemes that makes two seemingly conflicting claims:

MD5 is broken; it’s too slow to use as a general purpose hash; etc

The problem is that MD5 is fast

However, I was under the impression that MD5 still can be used as a non-cryptgraphic hash function:

Identifying malicious files, such as when Linux Mint's download servers were compromised and an ISO file was replaced by a malicious one; in this case you want to be sure that your file doesn't match; collision attacks aren't a vector here.

Finding duplicate files. By MD5-summing all files in a directory structure it's easy to find identical hashes. The seemingly identical files can then be compared in full to check if they are really identical. Using SHA512 would make the process slower, and since we compare files in full anyway there is no risk in a potential false positive from MD5. (In a way, this would be creating a rainbow table where all the files are the dictionary)

There are checksums of course, but from my experience, the likelihood of finding two different files with the same MD5 hash is very low as long as we can rule out foul play.

asked 20 hours ago

jornane

18814

4

$begingroup$
The collision attack is the problem, however, MD5 still has pre-image resistance.
$endgroup$
– kelalaka
20 hours ago

1

$begingroup$
"Using SHA512 would make the process slower ..." - On my system, openssl speed reports 747MB/s for MD5, and 738MB/s for SHA-512, so that's barely a difference ;)
$endgroup$
– marcelm
7 hours ago

$begingroup$
I believe MD5 can still be used as a PRF.
$endgroup$
– jww
1 hour ago

add a comment |

I read an article about password schemes that makes two seemingly conflicting claims:

MD5 is broken; it’s too slow to use as a general purpose hash; etc

The problem is that MD5 is fast

However, I was under the impression that MD5 still can be used as a non-cryptgraphic hash function:

Identifying malicious files, such as when Linux Mint's download servers were compromised and an ISO file was replaced by a malicious one; in this case you want to be sure that your file doesn't match; collision attacks aren't a vector here.

Finding duplicate files. By MD5-summing all files in a directory structure it's easy to find identical hashes. The seemingly identical files can then be compared in full to check if they are really identical. Using SHA512 would make the process slower, and since we compare files in full anyway there is no risk in a potential false positive from MD5. (In a way, this would be creating a rainbow table where all the files are the dictionary)

There are checksums of course, but from my experience, the likelihood of finding two different files with the same MD5 hash is very low as long as we can rule out foul play.

asked 20 hours ago

jornane

18814

I read an article about password schemes that makes two seemingly conflicting claims:

MD5 is broken; it’s too slow to use as a general purpose hash; etc

The problem is that MD5 is fast

However, I was under the impression that MD5 still can be used as a non-cryptgraphic hash function:

Identifying malicious files, such as when Linux Mint's download servers were compromised and an ISO file was replaced by a malicious one; in this case you want to be sure that your file doesn't match; collision attacks aren't a vector here.

Finding duplicate files. By MD5-summing all files in a directory structure it's easy to find identical hashes. The seemingly identical files can then be compared in full to check if they are really identical. Using SHA512 would make the process slower, and since we compare files in full anyway there is no risk in a potential false positive from MD5. (In a way, this would be creating a rainbow table where all the files are the dictionary)

There are checksums of course, but from my experience, the likelihood of finding two different files with the same MD5 hash is very low as long as we can rule out foul play.

hash md5

asked 20 hours ago

jornane

18814

asked 20 hours ago

jornane

18814

asked 20 hours ago

jornane

18814

asked 20 hours ago

jornane

18814

asked 20 hours ago

jornane

18814

4

$begingroup$
The collision attack is the problem, however, MD5 still has pre-image resistance.
$endgroup$
– kelalaka
20 hours ago

1

$begingroup$
"Using SHA512 would make the process slower ..." - On my system, openssl speed reports 747MB/s for MD5, and 738MB/s for SHA-512, so that's barely a difference ;)
$endgroup$
– marcelm
7 hours ago

$begingroup$
I believe MD5 can still be used as a PRF.
$endgroup$
– jww
1 hour ago

add a comment |

4

$begingroup$
The collision attack is the problem, however, MD5 still has pre-image resistance.
$endgroup$
– kelalaka
20 hours ago

1

$begingroup$
"Using SHA512 would make the process slower ..." - On my system, openssl speed reports 747MB/s for MD5, and 738MB/s for SHA-512, so that's barely a difference ;)
$endgroup$
– marcelm
7 hours ago

$begingroup$
I believe MD5 can still be used as a PRF.
$endgroup$
– jww
1 hour ago

The collision attack is the problem, however, MD5 still has pre-image resistance.

– kelalaka
20 hours ago

"Using SHA512 would make the process slower ..." - On my system, openssl speed reports 747MB/s for MD5, and 738MB/s for SHA-512, so that's barely a difference ;)

– marcelm
7 hours ago

I believe MD5 can still be used as a PRF.

– jww
1 hour ago

add a comment |

5 Answers
5

active

oldest

votes

I know that MD5 should not be used for password hashing, and that it also should not be used for integrity checking of documents. There are way too many sources citing MD5 preimaging attacks and MD5s low computation time.

There is no published preimage attack on MD5 that is cheaper than a generic attack on any 128-bit hash function. But you shouldn't rely on that alone when making security decisions, because cryptography is tricky and adversaries are clever and resourceful and can find ways around it!

Identifying malicious files, such as when Linux Mint's download servers were compromised and an ISO file was replaced by a malicious one; in this case you want to be sure that your file doesn't match; collision attacks aren't a vector here.

The question of whether to publish known-good vs. known-bad hashes after a compromise is addressed elsewhere—in brief, there's not much that publishing known-bad hashes accomplishes, and according to the citation, Linux Mint published known-good, not known-bad, hashes. So what security do you get from known-good MD5 hashes?

There are two issues here:

If you got the MD5 hash from the same source as the ISO image, there's nothing that would prevent an adversary from replacing both the MD5 hash and the ISO image.

To prevent this, you and the Linux Mint curators need two channels: one for the hashes which can't be compromised (but need only have very low bandwidth), and another for the ISO image (which needs high bandwidth) on which you can then use the MD5 hash in an attempt to detect compromise.

There's another way to prevent this: Instead of using the uncompromised channel for the hash of every ISO image over and over again as time goes on—which means more and more opportunities for an attacker to subvert it—use it once initially for a public key, which is then used to sign the ISO images; then there's only one opportunity for an attacker to subvert the public key channel.

Collision attacks may still be a vector in cases like this. Consider the following scenario:
- I am an evil developer. I write two software packages, whose distributions collide under MD5. One of the packages is benign and will survive review and audit. The other one will surreptitiously replace your family photo album by erotic photographs of sushi.
- The Linux Mint curators carefully scrutinize and audit everything they publish in their package repository and publish the MD5 hashes of what they have audited in a public place that I can't compromise.
- The Linux Mint curators cavalierly administer the package distributions in their package repository, under the false impression that the published MD5 hashes will protect users.
In this scenario, I can replace the benign package by the erotic sushi package, pass the MD5 verification with flying colors, and give you a nasty—and luscious—surprise when you try to look up photos of that old hiking trip you took your kids on.

Finding duplicate files. By MD5-summing all files in a directory structure it's easy to find identical hashes. The seemingly identical files can then be compared in full to check if they are really identical. Using SHA512 would make the process slower, and since we compare files in full anyway there is no risk in a potential false positive from MD5.

When I put my benign software package and my erotic sushi package, which collide under MD5, in your directory, your duplicate-detection script will initially think they are duplicates. In this case, you absolutely must compare the files in full. But there are much better ways to do this!

If you use SHA-512, you can safely skip the comparison step. Same if you use BLAKE2b, which can be even faster than MD5.

You could even use MD5 safely for this if you use it as HMAC-MD5 under a uniform random key, and safely skip the comparison step. HMAC-MD5 does not seem to be broken, as a pseudorandom function family—so it's probably fine for security, up to the birthday bound, but there are better faster PRFs like keyed BLAKE2 that won't raise any auditors' eyebrows.

Even better, you can choose a random key and hash the files with a universal hash under the key, like Poly1305. This is many times faster than MD5 or BLAKE2b, and the probability of a collision between any two files is less than $1/2^{100}$, so the probability of collision among $n$ files is less than $binom n 2 2^{-100}$ and thus you can still safely skip the comparison step until you have quadrillions of files.

You could also just use a cheap checksum like a CRC with a fixed polynomial. This will be the fastest of the options—far and away faster than MD5—but unlike the previous options you still absolutely must compare the files in full.

So, is MD5 safe for finding candidate duplicates to verify, if you subsequently compare the files bit by bit in full? Yes. So is the constant zero function.

(In a way, this would be creating a rainbow table where all the files are the dictionary)

This is not a rainbow table. A rainbow table is a specific technique for precomputing a random walk over a space of, say, passwords, via, say, MD5 hashes, in a way that saves effort trying to find MD5 preimages for hashes that aren't necessarily in your table in the first place, or doing it in parallel to speed up a multi-target search. It is not simply a list of precomputed hashes on a dictionary of inputs.

(The blog post by tptacek that you cited, and the blog post by Jeff Atwood that it was a response to, are both confused about what rainbow tables are.)

When the password scheme article states that "MD5 is fast", it clearly refers to the problem that hashing MD5 is too cheap when it comes to hashing a large amount of passwords to find the reverse of a hash. But what does it mean when it says that "[MD5 is] too slow to use as a general purpose hash"? Are there faster standardized hashes to compare files, that still have a reasonably low chance of collision?

I don't know what tptacek meant—you could email and ask—but if I had to guess, I would guess this meant it's awfully slow for things like hash tables, where you would truncate MD5 to a few bits to determine an index into an array of buckets or an open-addressing array.

edited 4 hours ago

answered 15 hours ago

Squeamish Ossifrage

23.4k133108

$begingroup$
Why does selecting a different hash algorithm eliminate the risk of collisions (sushi bullet)?
$endgroup$
– chrylis
11 hours ago

$begingroup$
@chrylis Nobody has ever published any way to find SHA-512 or BLAKE2b collisions, nor even, say, SHA-256 collisions.
$endgroup$
– Squeamish Ossifrage
9 hours ago

$begingroup$
What exactly is the difference between a hash function and a cheap checksum? Are just functions design with different trade offs in mind?
$endgroup$
– Alexander
9 hours ago

$begingroup$
@Alexander ‘Hash function’ means many things, and is usually some approximation to a uniform random choice of function in some context (random oracle model, pseudorandom function family, pseudorandom permutation family, etc.). A ‘checksum’ is used for some error-detecting capability; e.g., a well-designed 32-bit CRC is guaranteed to detect any 1-bit errors and usually guarantees detecting some larger number of bit errors in certain data word sizes, while a 32-bit truncation of SHA-256 might fail to detect some 1-bit errors. The terms are generally used quite loosely, however.
$endgroup$
– Squeamish Ossifrage
9 hours ago

1

$begingroup$
@SqueamishOssifrage "CRC is guaranteed to detect any 1-bit errors and usually guarantees detecting some larger number of bit errors in certain data word sizes, while a 32-bit truncation of SHA-256 might fail to detect some 1-bit errors. " woah. That's really powerful, and really cool. I didn't know that. I'll read more into it!
$endgroup$
– Alexander
9 hours ago

|
show 7 more comments

But what does it mean when it says that "[MD5 is] too slow to use as a general purpose hash"? Are there faster standardized hashes to compare files, that still have a reasonably low chance of collision?

BLAKE2 is faster than MD5 and currently known to provide 64-bit collision resistence when truncated to the same size as MD5 (compare ~30 of that of MD5).

answered 18 hours ago

DannyNiu

1,4551629

add a comment |

There's not a compelling reason to use MD5; however, there are some embedded systems with a MD5 core that was used as a stream verifier. In those systems, MD5 is still used. They are moving to BLAKE2 because it's smaller in silicon, and it has the benefit of being faster than MD5 in general.

The reason that MD5 started fall out of favor with hardware people was that the word reordering of the MD5 message expansions seems to be simple, but actually
they require a lot of circuits for demultiplexing and interconnect, and the hardware efficiencies are greatly degraded compared to BLAKE. In contrast, the message expansion blocks for BLAKE algorithms can be efficiently implemented as simple feedback shift registers.

The BLAKE team did a nice job of making it work well on silicon and in instructions.

answered 16 hours ago

b degnan

2,1121829

$begingroup$
I did not know about BLAKE, this seems interesting. I assumed from the post that there would be some hashing system that would not be cryptographically secure to be faster that MD5, but it seems BLAKE has managed to be the best of both worlds. I’m considering this answer as the accepted one, but I’ll wait a few days while there is activity around this question.
$endgroup$
– jornane
7 hours ago

$begingroup$
@jornane Squeamish Ossifrage has a better answer. I just wanted to mention the hardware side of things.
$endgroup$
– b degnan
4 hours ago

add a comment |

MD5 is currently used throughout the world both at home and in the enterprise. It's the file change mechanism within *nix's rsync if you opt for something other than changed timestamp detection. It's used for backup, archiving and file transfer between in-house systems. Even between enterprises over VPNs.

Your comment that it "should not be used for integrity checking of documents" is interesting, as that's kinda what is done when transferring files (aka documents). A hacked file/document is philosophically a changed file/document. Yet if the adversary takes advantage of MD5's low collision resistance, file/document propagation through systems can be stopped. Carefully made changes can therefore go unnoticed by rsync, and attacks can occur. I expect that this is somewhat of a niche attack but illustrates the concept vis-à-vis MD5 being at the heart of many computer systems.

In rsync's case, swapping out MD5 to something faster would only produce marginal overall speed improvement given storage and networking overheads. It would certainly be less than the simple ratio of hash rates suggests.

edited 13 hours ago

answered 16 hours ago

Paul Uszak

7,82711638

$begingroup$
I think librsync actually uses BLAKE2 now.
$endgroup$
– forest
3 hours ago

$begingroup$
@forest Actually, how may bits is the output of BLAKE2? My version logs 128 bit hashes...
$endgroup$
– Paul Uszak
3 hours ago

$begingroup$
BLAKE2b is 512, BLAKE2s is 256. It can be truncated though, of course.
$endgroup$
– forest
2 hours ago

$begingroup$
@forest Well you sound convincing, though man pages say MD5 and the hash is 32 hex characters. What would be the reason for truncation?
$endgroup$
– Paul Uszak
2 hours ago

$begingroup$
Truncation can be done to retain compatibility with the protocol. If the protocol is designed for a 128-bit hash, then it's simpler to truncate a larger hash than to change the protocol (possibly adding more overhead to something designed to minimize overhead). I'm not sure if it uses BLAKE2 the same way it used MD5, but I do know that it was "replacing MD5 with BLAKE2". The code has been added to librsync.
$endgroup$
– forest
2 hours ago

|
show 3 more comments

A case where the use of the MD5-hash would still make sense (and low risk of deleting duplicated files):

If you want to find duplicate files you can just use CRC32.

As soon as two files return the same CRC32-hash you recompute the files with MD5 hash. If the MD5 hash is again identical for both files then you know that the files are duplicates.

In a case of high risk by deleting files:

You want the process to be fast: Instead use a hash function that's not vulnerable for a second hash of the files, i.e. SHA2 or SHA3. It's extremely unlikely that these hashes would return an identical hash.

Speed is no concern: Compare the files byte per byte.

edited 19 hours ago

answered 20 hours ago

AleksanderRas

3,1191937

7

$begingroup$
Why use a second step after CRC32 at all? Compare the files byte-by-byte if you're going to read them again completely anyhow!
$endgroup$
– Ruben De Smet
19 hours ago

2

$begingroup$
@RubenDeSmet I think it's because to compare them byte-by-byte you'd have to buffer both files to a certain limit (because of memory constraints) and compare those. This will slow down sequential read speeds because you need to jump between the files. If this actually makes any real world difference provided a large enough buffer size is beyond my knowledge.
$endgroup$
– JensV
15 hours ago

1

$begingroup$
@JensV I am pretty sure that the speed difference between a byte-by-byte comparison and a SHA3 comparison (with reasonable buffer sizes) will be trivial. It might even favour the byte-by-byte comparison.
$endgroup$
– Martin Bonner
15 hours ago

4

$begingroup$
Comparing the files byte-by-byte requires communication. Computing a hash can be done locally. If the connection is slow compared to the hard drive speed, computing another hash after CRC32 might still be a reasonable option before comparing byte-by-byte.
$endgroup$
– JiK
14 hours ago

1

$begingroup$
@RubenDeSmet: I would accept your assertion for CRC64 but not CRC32. I would have posted a similar answer as this one where we found CRC32 to be inadequate due to too high of hash collision rate.
$endgroup$
– Joshua
13 hours ago

|
show 4 more comments

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "281"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
noCode: true, onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcrypto.stackexchange.com%2fquestions%2f70036%2fis-there-really-no-use-for-md5-anymore%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

5 Answers
5

active

oldest

votes

5 Answers
5

active

oldest

votes

I know that MD5 should not be used for password hashing, and that it also should not be used for integrity checking of documents. There are way too many sources citing MD5 preimaging attacks and MD5s low computation time.

Identifying malicious files, such as when Linux Mint's download servers were compromised and an ISO file was replaced by a malicious one; in this case you want to be sure that your file doesn't match; collision attacks aren't a vector here.

There are two issues here:

If you got the MD5 hash from the same source as the ISO image, there's nothing that would prevent an adversary from replacing both the MD5 hash and the ISO image.

To prevent this, you and the Linux Mint curators need two channels: one for the hashes which can't be compromised (but need only have very low bandwidth), and another for the ISO image (which needs high bandwidth) on which you can then use the MD5 hash in an attempt to detect compromise.

There's another way to prevent this: Instead of using the uncompromised channel for the hash of every ISO image over and over again as time goes on—which means more and more opportunities for an attacker to subvert it—use it once initially for a public key, which is then used to sign the ISO images; then there's only one opportunity for an attacker to subvert the public key channel.

Collision attacks may still be a vector in cases like this. Consider the following scenario:
- I am an evil developer. I write two software packages, whose distributions collide under MD5. One of the packages is benign and will survive review and audit. The other one will surreptitiously replace your family photo album by erotic photographs of sushi.
- The Linux Mint curators carefully scrutinize and audit everything they publish in their package repository and publish the MD5 hashes of what they have audited in a public place that I can't compromise.
- The Linux Mint curators cavalierly administer the package distributions in their package repository, under the false impression that the published MD5 hashes will protect users.
In this scenario, I can replace the benign package by the erotic sushi package, pass the MD5 verification with flying colors, and give you a nasty—and luscious—surprise when you try to look up photos of that old hiking trip you took your kids on.

Finding duplicate files. By MD5-summing all files in a directory structure it's easy to find identical hashes. The seemingly identical files can then be compared in full to check if they are really identical. Using SHA512 would make the process slower, and since we compare files in full anyway there is no risk in a potential false positive from MD5.

If you use SHA-512, you can safely skip the comparison step. Same if you use BLAKE2b, which can be even faster than MD5.

You could even use MD5 safely for this if you use it as HMAC-MD5 under a uniform random key, and safely skip the comparison step. HMAC-MD5 does not seem to be broken, as a pseudorandom function family—so it's probably fine for security, up to the birthday bound, but there are better faster PRFs like keyed BLAKE2 that won't raise any auditors' eyebrows.

Even better, you can choose a random key and hash the files with a universal hash under the key, like Poly1305. This is many times faster than MD5 or BLAKE2b, and the probability of a collision between any two files is less than $1/2^{100}$, so the probability of collision among $n$ files is less than $binom n 2 2^{-100}$ and thus you can still safely skip the comparison step until you have quadrillions of files.

You could also just use a cheap checksum like a CRC with a fixed polynomial. This will be the fastest of the options—far and away faster than MD5—but unlike the previous options you still absolutely must compare the files in full.

So, is MD5 safe for finding candidate duplicates to verify, if you subsequently compare the files bit by bit in full? Yes. So is the constant zero function.

(In a way, this would be creating a rainbow table where all the files are the dictionary)

(The blog post by tptacek that you cited, and the blog post by Jeff Atwood that it was a response to, are both confused about what rainbow tables are.)

When the password scheme article states that "MD5 is fast", it clearly refers to the problem that hashing MD5 is too cheap when it comes to hashing a large amount of passwords to find the reverse of a hash. But what does it mean when it says that "[MD5 is] too slow to use as a general purpose hash"? Are there faster standardized hashes to compare files, that still have a reasonably low chance of collision?

edited 4 hours ago

answered 15 hours ago

Squeamish Ossifrage

23.4k133108

$begingroup$
Why does selecting a different hash algorithm eliminate the risk of collisions (sushi bullet)?
$endgroup$
– chrylis
11 hours ago

$begingroup$
@chrylis Nobody has ever published any way to find SHA-512 or BLAKE2b collisions, nor even, say, SHA-256 collisions.
$endgroup$
– Squeamish Ossifrage
9 hours ago

$begingroup$
What exactly is the difference between a hash function and a cheap checksum? Are just functions design with different trade offs in mind?
$endgroup$
– Alexander
9 hours ago

$begingroup$
@Alexander ‘Hash function’ means many things, and is usually some approximation to a uniform random choice of function in some context (random oracle model, pseudorandom function family, pseudorandom permutation family, etc.). A ‘checksum’ is used for some error-detecting capability; e.g., a well-designed 32-bit CRC is guaranteed to detect any 1-bit errors and usually guarantees detecting some larger number of bit errors in certain data word sizes, while a 32-bit truncation of SHA-256 might fail to detect some 1-bit errors. The terms are generally used quite loosely, however.
$endgroup$
– Squeamish Ossifrage
9 hours ago

1

$begingroup$
@SqueamishOssifrage "CRC is guaranteed to detect any 1-bit errors and usually guarantees detecting some larger number of bit errors in certain data word sizes, while a 32-bit truncation of SHA-256 might fail to detect some 1-bit errors. " woah. That's really powerful, and really cool. I didn't know that. I'll read more into it!
$endgroup$
– Alexander
9 hours ago

|
show 7 more comments

I know that MD5 should not be used for password hashing, and that it also should not be used for integrity checking of documents. There are way too many sources citing MD5 preimaging attacks and MD5s low computation time.

Identifying malicious files, such as when Linux Mint's download servers were compromised and an ISO file was replaced by a malicious one; in this case you want to be sure that your file doesn't match; collision attacks aren't a vector here.

There are two issues here:

If you got the MD5 hash from the same source as the ISO image, there's nothing that would prevent an adversary from replacing both the MD5 hash and the ISO image.

To prevent this, you and the Linux Mint curators need two channels: one for the hashes which can't be compromised (but need only have very low bandwidth), and another for the ISO image (which needs high bandwidth) on which you can then use the MD5 hash in an attempt to detect compromise.

There's another way to prevent this: Instead of using the uncompromised channel for the hash of every ISO image over and over again as time goes on—which means more and more opportunities for an attacker to subvert it—use it once initially for a public key, which is then used to sign the ISO images; then there's only one opportunity for an attacker to subvert the public key channel.

Collision attacks may still be a vector in cases like this. Consider the following scenario:
- I am an evil developer. I write two software packages, whose distributions collide under MD5. One of the packages is benign and will survive review and audit. The other one will surreptitiously replace your family photo album by erotic photographs of sushi.
- The Linux Mint curators carefully scrutinize and audit everything they publish in their package repository and publish the MD5 hashes of what they have audited in a public place that I can't compromise.
- The Linux Mint curators cavalierly administer the package distributions in their package repository, under the false impression that the published MD5 hashes will protect users.
In this scenario, I can replace the benign package by the erotic sushi package, pass the MD5 verification with flying colors, and give you a nasty—and luscious—surprise when you try to look up photos of that old hiking trip you took your kids on.

Finding duplicate files. By MD5-summing all files in a directory structure it's easy to find identical hashes. The seemingly identical files can then be compared in full to check if they are really identical. Using SHA512 would make the process slower, and since we compare files in full anyway there is no risk in a potential false positive from MD5.

If you use SHA-512, you can safely skip the comparison step. Same if you use BLAKE2b, which can be even faster than MD5.

You could even use MD5 safely for this if you use it as HMAC-MD5 under a uniform random key, and safely skip the comparison step. HMAC-MD5 does not seem to be broken, as a pseudorandom function family—so it's probably fine for security, up to the birthday bound, but there are better faster PRFs like keyed BLAKE2 that won't raise any auditors' eyebrows.

Even better, you can choose a random key and hash the files with a universal hash under the key, like Poly1305. This is many times faster than MD5 or BLAKE2b, and the probability of a collision between any two files is less than $1/2^{100}$, so the probability of collision among $n$ files is less than $binom n 2 2^{-100}$ and thus you can still safely skip the comparison step until you have quadrillions of files.

You could also just use a cheap checksum like a CRC with a fixed polynomial. This will be the fastest of the options—far and away faster than MD5—but unlike the previous options you still absolutely must compare the files in full.

So, is MD5 safe for finding candidate duplicates to verify, if you subsequently compare the files bit by bit in full? Yes. So is the constant zero function.

(In a way, this would be creating a rainbow table where all the files are the dictionary)

(The blog post by tptacek that you cited, and the blog post by Jeff Atwood that it was a response to, are both confused about what rainbow tables are.)

When the password scheme article states that "MD5 is fast", it clearly refers to the problem that hashing MD5 is too cheap when it comes to hashing a large amount of passwords to find the reverse of a hash. But what does it mean when it says that "[MD5 is] too slow to use as a general purpose hash"? Are there faster standardized hashes to compare files, that still have a reasonably low chance of collision?

edited 4 hours ago

answered 15 hours ago

Squeamish Ossifrage

23.4k133108

$begingroup$
Why does selecting a different hash algorithm eliminate the risk of collisions (sushi bullet)?
$endgroup$
– chrylis
11 hours ago

$begingroup$
@chrylis Nobody has ever published any way to find SHA-512 or BLAKE2b collisions, nor even, say, SHA-256 collisions.
$endgroup$
– Squeamish Ossifrage
9 hours ago

$begingroup$
What exactly is the difference between a hash function and a cheap checksum? Are just functions design with different trade offs in mind?
$endgroup$
– Alexander
9 hours ago

$begingroup$
@Alexander ‘Hash function’ means many things, and is usually some approximation to a uniform random choice of function in some context (random oracle model, pseudorandom function family, pseudorandom permutation family, etc.). A ‘checksum’ is used for some error-detecting capability; e.g., a well-designed 32-bit CRC is guaranteed to detect any 1-bit errors and usually guarantees detecting some larger number of bit errors in certain data word sizes, while a 32-bit truncation of SHA-256 might fail to detect some 1-bit errors. The terms are generally used quite loosely, however.
$endgroup$
– Squeamish Ossifrage
9 hours ago

1

$begingroup$
@SqueamishOssifrage "CRC is guaranteed to detect any 1-bit errors and usually guarantees detecting some larger number of bit errors in certain data word sizes, while a 32-bit truncation of SHA-256 might fail to detect some 1-bit errors. " woah. That's really powerful, and really cool. I didn't know that. I'll read more into it!
$endgroup$
– Alexander
9 hours ago

|
show 7 more comments

I know that MD5 should not be used for password hashing, and that it also should not be used for integrity checking of documents. There are way too many sources citing MD5 preimaging attacks and MD5s low computation time.

Identifying malicious files, such as when Linux Mint's download servers were compromised and an ISO file was replaced by a malicious one; in this case you want to be sure that your file doesn't match; collision attacks aren't a vector here.

There are two issues here:

If you got the MD5 hash from the same source as the ISO image, there's nothing that would prevent an adversary from replacing both the MD5 hash and the ISO image.

To prevent this, you and the Linux Mint curators need two channels: one for the hashes which can't be compromised (but need only have very low bandwidth), and another for the ISO image (which needs high bandwidth) on which you can then use the MD5 hash in an attempt to detect compromise.

There's another way to prevent this: Instead of using the uncompromised channel for the hash of every ISO image over and over again as time goes on—which means more and more opportunities for an attacker to subvert it—use it once initially for a public key, which is then used to sign the ISO images; then there's only one opportunity for an attacker to subvert the public key channel.

Collision attacks may still be a vector in cases like this. Consider the following scenario:
- I am an evil developer. I write two software packages, whose distributions collide under MD5. One of the packages is benign and will survive review and audit. The other one will surreptitiously replace your family photo album by erotic photographs of sushi.
- The Linux Mint curators carefully scrutinize and audit everything they publish in their package repository and publish the MD5 hashes of what they have audited in a public place that I can't compromise.
- The Linux Mint curators cavalierly administer the package distributions in their package repository, under the false impression that the published MD5 hashes will protect users.
In this scenario, I can replace the benign package by the erotic sushi package, pass the MD5 verification with flying colors, and give you a nasty—and luscious—surprise when you try to look up photos of that old hiking trip you took your kids on.

Finding duplicate files. By MD5-summing all files in a directory structure it's easy to find identical hashes. The seemingly identical files can then be compared in full to check if they are really identical. Using SHA512 would make the process slower, and since we compare files in full anyway there is no risk in a potential false positive from MD5.

If you use SHA-512, you can safely skip the comparison step. Same if you use BLAKE2b, which can be even faster than MD5.

You could even use MD5 safely for this if you use it as HMAC-MD5 under a uniform random key, and safely skip the comparison step. HMAC-MD5 does not seem to be broken, as a pseudorandom function family—so it's probably fine for security, up to the birthday bound, but there are better faster PRFs like keyed BLAKE2 that won't raise any auditors' eyebrows.

Even better, you can choose a random key and hash the files with a universal hash under the key, like Poly1305. This is many times faster than MD5 or BLAKE2b, and the probability of a collision between any two files is less than $1/2^{100}$, so the probability of collision among $n$ files is less than $binom n 2 2^{-100}$ and thus you can still safely skip the comparison step until you have quadrillions of files.

You could also just use a cheap checksum like a CRC with a fixed polynomial. This will be the fastest of the options—far and away faster than MD5—but unlike the previous options you still absolutely must compare the files in full.

So, is MD5 safe for finding candidate duplicates to verify, if you subsequently compare the files bit by bit in full? Yes. So is the constant zero function.

(In a way, this would be creating a rainbow table where all the files are the dictionary)

(The blog post by tptacek that you cited, and the blog post by Jeff Atwood that it was a response to, are both confused about what rainbow tables are.)

When the password scheme article states that "MD5 is fast", it clearly refers to the problem that hashing MD5 is too cheap when it comes to hashing a large amount of passwords to find the reverse of a hash. But what does it mean when it says that "[MD5 is] too slow to use as a general purpose hash"? Are there faster standardized hashes to compare files, that still have a reasonably low chance of collision?

edited 4 hours ago

answered 15 hours ago

Squeamish Ossifrage

23.4k133108

I know that MD5 should not be used for password hashing, and that it also should not be used for integrity checking of documents. There are way too many sources citing MD5 preimaging attacks and MD5s low computation time.

Identifying malicious files, such as when Linux Mint's download servers were compromised and an ISO file was replaced by a malicious one; in this case you want to be sure that your file doesn't match; collision attacks aren't a vector here.

There are two issues here:

If you got the MD5 hash from the same source as the ISO image, there's nothing that would prevent an adversary from replacing both the MD5 hash and the ISO image.

To prevent this, you and the Linux Mint curators need two channels: one for the hashes which can't be compromised (but need only have very low bandwidth), and another for the ISO image (which needs high bandwidth) on which you can then use the MD5 hash in an attempt to detect compromise.

There's another way to prevent this: Instead of using the uncompromised channel for the hash of every ISO image over and over again as time goes on—which means more and more opportunities for an attacker to subvert it—use it once initially for a public key, which is then used to sign the ISO images; then there's only one opportunity for an attacker to subvert the public key channel.

Collision attacks may still be a vector in cases like this. Consider the following scenario:
- I am an evil developer. I write two software packages, whose distributions collide under MD5. One of the packages is benign and will survive review and audit. The other one will surreptitiously replace your family photo album by erotic photographs of sushi.
- The Linux Mint curators carefully scrutinize and audit everything they publish in their package repository and publish the MD5 hashes of what they have audited in a public place that I can't compromise.
- The Linux Mint curators cavalierly administer the package distributions in their package repository, under the false impression that the published MD5 hashes will protect users.
In this scenario, I can replace the benign package by the erotic sushi package, pass the MD5 verification with flying colors, and give you a nasty—and luscious—surprise when you try to look up photos of that old hiking trip you took your kids on.

Finding duplicate files. By MD5-summing all files in a directory structure it's easy to find identical hashes. The seemingly identical files can then be compared in full to check if they are really identical. Using SHA512 would make the process slower, and since we compare files in full anyway there is no risk in a potential false positive from MD5.

If you use SHA-512, you can safely skip the comparison step. Same if you use BLAKE2b, which can be even faster than MD5.

You could even use MD5 safely for this if you use it as HMAC-MD5 under a uniform random key, and safely skip the comparison step. HMAC-MD5 does not seem to be broken, as a pseudorandom function family—so it's probably fine for security, up to the birthday bound, but there are better faster PRFs like keyed BLAKE2 that won't raise any auditors' eyebrows.

Even better, you can choose a random key and hash the files with a universal hash under the key, like Poly1305. This is many times faster than MD5 or BLAKE2b, and the probability of a collision between any two files is less than $1/2^{100}$, so the probability of collision among $n$ files is less than $binom n 2 2^{-100}$ and thus you can still safely skip the comparison step until you have quadrillions of files.

You could also just use a cheap checksum like a CRC with a fixed polynomial. This will be the fastest of the options—far and away faster than MD5—but unlike the previous options you still absolutely must compare the files in full.

So, is MD5 safe for finding candidate duplicates to verify, if you subsequently compare the files bit by bit in full? Yes. So is the constant zero function.

(In a way, this would be creating a rainbow table where all the files are the dictionary)

(The blog post by tptacek that you cited, and the blog post by Jeff Atwood that it was a response to, are both confused about what rainbow tables are.)

When the password scheme article states that "MD5 is fast", it clearly refers to the problem that hashing MD5 is too cheap when it comes to hashing a large amount of passwords to find the reverse of a hash. But what does it mean when it says that "[MD5 is] too slow to use as a general purpose hash"? Are there faster standardized hashes to compare files, that still have a reasonably low chance of collision?

edited 4 hours ago

answered 15 hours ago

Squeamish Ossifrage

23.4k133108

edited 4 hours ago

answered 15 hours ago

Squeamish Ossifrage

23.4k133108

answered 15 hours ago

Squeamish Ossifrage

23.4k133108

answered 15 hours ago

Squeamish Ossifrage

23.4k133108

$begingroup$
Why does selecting a different hash algorithm eliminate the risk of collisions (sushi bullet)?
$endgroup$
– chrylis
11 hours ago

$begingroup$
@chrylis Nobody has ever published any way to find SHA-512 or BLAKE2b collisions, nor even, say, SHA-256 collisions.
$endgroup$
– Squeamish Ossifrage
9 hours ago

$begingroup$
What exactly is the difference between a hash function and a cheap checksum? Are just functions design with different trade offs in mind?
$endgroup$
– Alexander
9 hours ago

$begingroup$
@Alexander ‘Hash function’ means many things, and is usually some approximation to a uniform random choice of function in some context (random oracle model, pseudorandom function family, pseudorandom permutation family, etc.). A ‘checksum’ is used for some error-detecting capability; e.g., a well-designed 32-bit CRC is guaranteed to detect any 1-bit errors and usually guarantees detecting some larger number of bit errors in certain data word sizes, while a 32-bit truncation of SHA-256 might fail to detect some 1-bit errors. The terms are generally used quite loosely, however.
$endgroup$
– Squeamish Ossifrage
9 hours ago

1

$begingroup$
@SqueamishOssifrage "CRC is guaranteed to detect any 1-bit errors and usually guarantees detecting some larger number of bit errors in certain data word sizes, while a 32-bit truncation of SHA-256 might fail to detect some 1-bit errors. " woah. That's really powerful, and really cool. I didn't know that. I'll read more into it!
$endgroup$
– Alexander
9 hours ago

|
show 7 more comments

$begingroup$
Why does selecting a different hash algorithm eliminate the risk of collisions (sushi bullet)?
$endgroup$
– chrylis
11 hours ago

$begingroup$
@chrylis Nobody has ever published any way to find SHA-512 or BLAKE2b collisions, nor even, say, SHA-256 collisions.
$endgroup$
– Squeamish Ossifrage
9 hours ago

$begingroup$
What exactly is the difference between a hash function and a cheap checksum? Are just functions design with different trade offs in mind?
$endgroup$
– Alexander
9 hours ago

$begingroup$
@Alexander ‘Hash function’ means many things, and is usually some approximation to a uniform random choice of function in some context (random oracle model, pseudorandom function family, pseudorandom permutation family, etc.). A ‘checksum’ is used for some error-detecting capability; e.g., a well-designed 32-bit CRC is guaranteed to detect any 1-bit errors and usually guarantees detecting some larger number of bit errors in certain data word sizes, while a 32-bit truncation of SHA-256 might fail to detect some 1-bit errors. The terms are generally used quite loosely, however.
$endgroup$
– Squeamish Ossifrage
9 hours ago

1

$begingroup$
@SqueamishOssifrage "CRC is guaranteed to detect any 1-bit errors and usually guarantees detecting some larger number of bit errors in certain data word sizes, while a 32-bit truncation of SHA-256 might fail to detect some 1-bit errors. " woah. That's really powerful, and really cool. I didn't know that. I'll read more into it!
$endgroup$
– Alexander
9 hours ago

Why does selecting a different hash algorithm eliminate the risk of collisions (sushi bullet)?

– chrylis
11 hours ago

@chrylis Nobody has ever published any way to find SHA-512 or BLAKE2b collisions, nor even, say, SHA-256 collisions.

– Squeamish Ossifrage
9 hours ago

What exactly is the difference between a hash function and a cheap checksum? Are just functions design with different trade offs in mind?

– Alexander
9 hours ago

@Alexander ‘Hash function’ means many things, and is usually some approximation to a uniform random choice of function in some context (random oracle model, pseudorandom function family, pseudorandom permutation family, etc.). A ‘checksum’ is used for some error-detecting capability; e.g., a well-designed 32-bit CRC is guaranteed to detect any 1-bit errors and usually guarantees detecting some larger number of bit errors in certain data word sizes, while a 32-bit truncation of SHA-256 might fail to detect some 1-bit errors. The terms are generally used quite loosely, however.

– Squeamish Ossifrage
9 hours ago

@SqueamishOssifrage "CRC is guaranteed to detect any 1-bit errors and usually guarantees detecting some larger number of bit errors in certain data word sizes, while a 32-bit truncation of SHA-256 might fail to detect some 1-bit errors. " woah. That's really powerful, and really cool. I didn't know that. I'll read more into it!

– Alexander
9 hours ago

|
show 7 more comments

But what does it mean when it says that "[MD5 is] too slow to use as a general purpose hash"? Are there faster standardized hashes to compare files, that still have a reasonably low chance of collision?

BLAKE2 is faster than MD5 and currently known to provide 64-bit collision resistence when truncated to the same size as MD5 (compare ~30 of that of MD5).

answered 18 hours ago

DannyNiu

1,4551629

add a comment |

But what does it mean when it says that "[MD5 is] too slow to use as a general purpose hash"? Are there faster standardized hashes to compare files, that still have a reasonably low chance of collision?

BLAKE2 is faster than MD5 and currently known to provide 64-bit collision resistence when truncated to the same size as MD5 (compare ~30 of that of MD5).

answered 18 hours ago

DannyNiu

1,4551629

add a comment |

But what does it mean when it says that "[MD5 is] too slow to use as a general purpose hash"? Are there faster standardized hashes to compare files, that still have a reasonably low chance of collision?

BLAKE2 is faster than MD5 and currently known to provide 64-bit collision resistence when truncated to the same size as MD5 (compare ~30 of that of MD5).

answered 18 hours ago

DannyNiu

1,4551629

But what does it mean when it says that "[MD5 is] too slow to use as a general purpose hash"? Are there faster standardized hashes to compare files, that still have a reasonably low chance of collision?

BLAKE2 is faster than MD5 and currently known to provide 64-bit collision resistence when truncated to the same size as MD5 (compare ~30 of that of MD5).

answered 18 hours ago

DannyNiu

1,4551629

answered 18 hours ago

DannyNiu

1,4551629

answered 18 hours ago

DannyNiu

1,4551629

answered 18 hours ago

DannyNiu

1,4551629

add a comment |

The BLAKE team did a nice job of making it work well on silicon and in instructions.

answered 16 hours ago

b degnan

2,1121829

$begingroup$
I did not know about BLAKE, this seems interesting. I assumed from the post that there would be some hashing system that would not be cryptographically secure to be faster that MD5, but it seems BLAKE has managed to be the best of both worlds. I’m considering this answer as the accepted one, but I’ll wait a few days while there is activity around this question.
$endgroup$
– jornane
7 hours ago

$begingroup$
@jornane Squeamish Ossifrage has a better answer. I just wanted to mention the hardware side of things.
$endgroup$
– b degnan
4 hours ago

add a comment |

The BLAKE team did a nice job of making it work well on silicon and in instructions.

answered 16 hours ago

b degnan

2,1121829

$begingroup$
I did not know about BLAKE, this seems interesting. I assumed from the post that there would be some hashing system that would not be cryptographically secure to be faster that MD5, but it seems BLAKE has managed to be the best of both worlds. I’m considering this answer as the accepted one, but I’ll wait a few days while there is activity around this question.
$endgroup$
– jornane
7 hours ago

$begingroup$
@jornane Squeamish Ossifrage has a better answer. I just wanted to mention the hardware side of things.
$endgroup$
– b degnan
4 hours ago

add a comment |

The BLAKE team did a nice job of making it work well on silicon and in instructions.

answered 16 hours ago

b degnan

2,1121829

The BLAKE team did a nice job of making it work well on silicon and in instructions.

answered 16 hours ago

b degnan

2,1121829

answered 16 hours ago

b degnan

2,1121829

answered 16 hours ago

b degnan

2,1121829

answered 16 hours ago

b degnan

2,1121829

$begingroup$
I did not know about BLAKE, this seems interesting. I assumed from the post that there would be some hashing system that would not be cryptographically secure to be faster that MD5, but it seems BLAKE has managed to be the best of both worlds. I’m considering this answer as the accepted one, but I’ll wait a few days while there is activity around this question.
$endgroup$
– jornane
7 hours ago

$begingroup$
@jornane Squeamish Ossifrage has a better answer. I just wanted to mention the hardware side of things.
$endgroup$
– b degnan
4 hours ago

add a comment |

$begingroup$
I did not know about BLAKE, this seems interesting. I assumed from the post that there would be some hashing system that would not be cryptographically secure to be faster that MD5, but it seems BLAKE has managed to be the best of both worlds. I’m considering this answer as the accepted one, but I’ll wait a few days while there is activity around this question.
$endgroup$
– jornane
7 hours ago

$begingroup$
@jornane Squeamish Ossifrage has a better answer. I just wanted to mention the hardware side of things.
$endgroup$
– b degnan
4 hours ago

I did not know about BLAKE, this seems interesting. I assumed from the post that there would be some hashing system that would not be cryptographically secure to be faster that MD5, but it seems BLAKE has managed to be the best of both worlds. I’m considering this answer as the accepted one, but I’ll wait a few days while there is activity around this question.

– jornane
7 hours ago

@jornane Squeamish Ossifrage has a better answer. I just wanted to mention the hardware side of things.

– b degnan
4 hours ago

add a comment |

edited 13 hours ago

answered 16 hours ago

Paul Uszak

7,82711638

$begingroup$
I think librsync actually uses BLAKE2 now.
$endgroup$
– forest
3 hours ago

$begingroup$
@forest Actually, how may bits is the output of BLAKE2? My version logs 128 bit hashes...
$endgroup$
– Paul Uszak
3 hours ago

$begingroup$
BLAKE2b is 512, BLAKE2s is 256. It can be truncated though, of course.
$endgroup$
– forest
2 hours ago

$begingroup$
@forest Well you sound convincing, though man pages say MD5 and the hash is 32 hex characters. What would be the reason for truncation?
$endgroup$
– Paul Uszak
2 hours ago

$begingroup$
Truncation can be done to retain compatibility with the protocol. If the protocol is designed for a 128-bit hash, then it's simpler to truncate a larger hash than to change the protocol (possibly adding more overhead to something designed to minimize overhead). I'm not sure if it uses BLAKE2 the same way it used MD5, but I do know that it was "replacing MD5 with BLAKE2". The code has been added to librsync.
$endgroup$
– forest
2 hours ago

|
show 3 more comments

edited 13 hours ago

answered 16 hours ago

Paul Uszak

7,82711638

$begingroup$
I think librsync actually uses BLAKE2 now.
$endgroup$
– forest
3 hours ago

$begingroup$
@forest Actually, how may bits is the output of BLAKE2? My version logs 128 bit hashes...
$endgroup$
– Paul Uszak
3 hours ago

$begingroup$
BLAKE2b is 512, BLAKE2s is 256. It can be truncated though, of course.
$endgroup$
– forest
2 hours ago

$begingroup$
@forest Well you sound convincing, though man pages say MD5 and the hash is 32 hex characters. What would be the reason for truncation?
$endgroup$
– Paul Uszak
2 hours ago

$begingroup$
Truncation can be done to retain compatibility with the protocol. If the protocol is designed for a 128-bit hash, then it's simpler to truncate a larger hash than to change the protocol (possibly adding more overhead to something designed to minimize overhead). I'm not sure if it uses BLAKE2 the same way it used MD5, but I do know that it was "replacing MD5 with BLAKE2". The code has been added to librsync.
$endgroup$
– forest
2 hours ago

|
show 3 more comments

edited 13 hours ago

answered 16 hours ago

Paul Uszak

7,82711638

edited 13 hours ago

answered 16 hours ago

Paul Uszak

7,82711638

edited 13 hours ago

answered 16 hours ago

Paul Uszak

7,82711638

answered 16 hours ago

Paul Uszak

7,82711638

answered 16 hours ago

Paul Uszak

7,82711638

$begingroup$
I think librsync actually uses BLAKE2 now.
$endgroup$
– forest
3 hours ago

$begingroup$
@forest Actually, how may bits is the output of BLAKE2? My version logs 128 bit hashes...
$endgroup$
– Paul Uszak
3 hours ago

$begingroup$
BLAKE2b is 512, BLAKE2s is 256. It can be truncated though, of course.
$endgroup$
– forest
2 hours ago

$begingroup$
@forest Well you sound convincing, though man pages say MD5 and the hash is 32 hex characters. What would be the reason for truncation?
$endgroup$
– Paul Uszak
2 hours ago

$begingroup$
Truncation can be done to retain compatibility with the protocol. If the protocol is designed for a 128-bit hash, then it's simpler to truncate a larger hash than to change the protocol (possibly adding more overhead to something designed to minimize overhead). I'm not sure if it uses BLAKE2 the same way it used MD5, but I do know that it was "replacing MD5 with BLAKE2". The code has been added to librsync.
$endgroup$
– forest
2 hours ago

|
show 3 more comments

$begingroup$
I think librsync actually uses BLAKE2 now.
$endgroup$
– forest
3 hours ago

$begingroup$
@forest Actually, how may bits is the output of BLAKE2? My version logs 128 bit hashes...
$endgroup$
– Paul Uszak
3 hours ago

$begingroup$
BLAKE2b is 512, BLAKE2s is 256. It can be truncated though, of course.
$endgroup$
– forest
2 hours ago

$begingroup$
@forest Well you sound convincing, though man pages say MD5 and the hash is 32 hex characters. What would be the reason for truncation?
$endgroup$
– Paul Uszak
2 hours ago

$begingroup$
Truncation can be done to retain compatibility with the protocol. If the protocol is designed for a 128-bit hash, then it's simpler to truncate a larger hash than to change the protocol (possibly adding more overhead to something designed to minimize overhead). I'm not sure if it uses BLAKE2 the same way it used MD5, but I do know that it was "replacing MD5 with BLAKE2". The code has been added to librsync.
$endgroup$
– forest
2 hours ago

I think librsync actually uses BLAKE2 now.

– forest
3 hours ago

@forest Actually, how may bits is the output of BLAKE2? My version logs 128 bit hashes...

– Paul Uszak
3 hours ago

BLAKE2b is 512, BLAKE2s is 256. It can be truncated though, of course.

– forest
2 hours ago

@forest Well you sound convincing, though man pages say MD5 and the hash is 32 hex characters. What would be the reason for truncation?

– Paul Uszak
2 hours ago

Truncation can be done to retain compatibility with the protocol. If the protocol is designed for a 128-bit hash, then it's simpler to truncate a larger hash than to change the protocol (possibly adding more overhead to something designed to minimize overhead). I'm not sure if it uses BLAKE2 the same way it used MD5, but I do know that it was "replacing MD5 with BLAKE2". The code has been added to librsync.

– forest
2 hours ago

|
show 3 more comments

A case where the use of the MD5-hash would still make sense (and low risk of deleting duplicated files):

If you want to find duplicate files you can just use CRC32.

As soon as two files return the same CRC32-hash you recompute the files with MD5 hash. If the MD5 hash is again identical for both files then you know that the files are duplicates.

In a case of high risk by deleting files:

Speed is no concern: Compare the files byte per byte.

edited 19 hours ago

answered 20 hours ago

AleksanderRas

3,1191937

7

$begingroup$
Why use a second step after CRC32 at all? Compare the files byte-by-byte if you're going to read them again completely anyhow!
$endgroup$
– Ruben De Smet
19 hours ago

2

$begingroup$
@RubenDeSmet I think it's because to compare them byte-by-byte you'd have to buffer both files to a certain limit (because of memory constraints) and compare those. This will slow down sequential read speeds because you need to jump between the files. If this actually makes any real world difference provided a large enough buffer size is beyond my knowledge.
$endgroup$
– JensV
15 hours ago

1

$begingroup$
@JensV I am pretty sure that the speed difference between a byte-by-byte comparison and a SHA3 comparison (with reasonable buffer sizes) will be trivial. It might even favour the byte-by-byte comparison.
$endgroup$
– Martin Bonner
15 hours ago

4

$begingroup$
Comparing the files byte-by-byte requires communication. Computing a hash can be done locally. If the connection is slow compared to the hard drive speed, computing another hash after CRC32 might still be a reasonable option before comparing byte-by-byte.
$endgroup$
– JiK
14 hours ago

1

$begingroup$
@RubenDeSmet: I would accept your assertion for CRC64 but not CRC32. I would have posted a similar answer as this one where we found CRC32 to be inadequate due to too high of hash collision rate.
$endgroup$
– Joshua
13 hours ago

|
show 4 more comments

A case where the use of the MD5-hash would still make sense (and low risk of deleting duplicated files):

If you want to find duplicate files you can just use CRC32.

As soon as two files return the same CRC32-hash you recompute the files with MD5 hash. If the MD5 hash is again identical for both files then you know that the files are duplicates.

In a case of high risk by deleting files:

Speed is no concern: Compare the files byte per byte.

edited 19 hours ago

answered 20 hours ago

AleksanderRas

3,1191937

7

$begingroup$
Why use a second step after CRC32 at all? Compare the files byte-by-byte if you're going to read them again completely anyhow!
$endgroup$
– Ruben De Smet
19 hours ago

2

$begingroup$
@RubenDeSmet I think it's because to compare them byte-by-byte you'd have to buffer both files to a certain limit (because of memory constraints) and compare those. This will slow down sequential read speeds because you need to jump between the files. If this actually makes any real world difference provided a large enough buffer size is beyond my knowledge.
$endgroup$
– JensV
15 hours ago

1

$begingroup$
@JensV I am pretty sure that the speed difference between a byte-by-byte comparison and a SHA3 comparison (with reasonable buffer sizes) will be trivial. It might even favour the byte-by-byte comparison.
$endgroup$
– Martin Bonner
15 hours ago

4

$begingroup$
Comparing the files byte-by-byte requires communication. Computing a hash can be done locally. If the connection is slow compared to the hard drive speed, computing another hash after CRC32 might still be a reasonable option before comparing byte-by-byte.
$endgroup$
– JiK
14 hours ago

1

$begingroup$
@RubenDeSmet: I would accept your assertion for CRC64 but not CRC32. I would have posted a similar answer as this one where we found CRC32 to be inadequate due to too high of hash collision rate.
$endgroup$
– Joshua
13 hours ago

|
show 4 more comments

A case where the use of the MD5-hash would still make sense (and low risk of deleting duplicated files):

If you want to find duplicate files you can just use CRC32.

As soon as two files return the same CRC32-hash you recompute the files with MD5 hash. If the MD5 hash is again identical for both files then you know that the files are duplicates.

In a case of high risk by deleting files:

Speed is no concern: Compare the files byte per byte.

edited 19 hours ago

answered 20 hours ago

AleksanderRas

3,1191937

A case where the use of the MD5-hash would still make sense (and low risk of deleting duplicated files):

If you want to find duplicate files you can just use CRC32.

As soon as two files return the same CRC32-hash you recompute the files with MD5 hash. If the MD5 hash is again identical for both files then you know that the files are duplicates.

In a case of high risk by deleting files:

Speed is no concern: Compare the files byte per byte.

edited 19 hours ago

answered 20 hours ago

AleksanderRas

3,1191937

edited 19 hours ago

answered 20 hours ago

AleksanderRas

3,1191937

answered 20 hours ago

AleksanderRas

3,1191937

answered 20 hours ago

AleksanderRas

3,1191937

7

$begingroup$
Why use a second step after CRC32 at all? Compare the files byte-by-byte if you're going to read them again completely anyhow!
$endgroup$
– Ruben De Smet
19 hours ago

2

$begingroup$
@RubenDeSmet I think it's because to compare them byte-by-byte you'd have to buffer both files to a certain limit (because of memory constraints) and compare those. This will slow down sequential read speeds because you need to jump between the files. If this actually makes any real world difference provided a large enough buffer size is beyond my knowledge.
$endgroup$
– JensV
15 hours ago

1

$begingroup$
@JensV I am pretty sure that the speed difference between a byte-by-byte comparison and a SHA3 comparison (with reasonable buffer sizes) will be trivial. It might even favour the byte-by-byte comparison.
$endgroup$
– Martin Bonner
15 hours ago

4

$begingroup$
Comparing the files byte-by-byte requires communication. Computing a hash can be done locally. If the connection is slow compared to the hard drive speed, computing another hash after CRC32 might still be a reasonable option before comparing byte-by-byte.
$endgroup$
– JiK
14 hours ago

1

$begingroup$
@RubenDeSmet: I would accept your assertion for CRC64 but not CRC32. I would have posted a similar answer as this one where we found CRC32 to be inadequate due to too high of hash collision rate.
$endgroup$
– Joshua
13 hours ago

|
show 4 more comments

7

$begingroup$
Why use a second step after CRC32 at all? Compare the files byte-by-byte if you're going to read them again completely anyhow!
$endgroup$
– Ruben De Smet
19 hours ago

2

$begingroup$
@RubenDeSmet I think it's because to compare them byte-by-byte you'd have to buffer both files to a certain limit (because of memory constraints) and compare those. This will slow down sequential read speeds because you need to jump between the files. If this actually makes any real world difference provided a large enough buffer size is beyond my knowledge.
$endgroup$
– JensV
15 hours ago

1

$begingroup$
@JensV I am pretty sure that the speed difference between a byte-by-byte comparison and a SHA3 comparison (with reasonable buffer sizes) will be trivial. It might even favour the byte-by-byte comparison.
$endgroup$
– Martin Bonner
15 hours ago

4

$begingroup$
Comparing the files byte-by-byte requires communication. Computing a hash can be done locally. If the connection is slow compared to the hard drive speed, computing another hash after CRC32 might still be a reasonable option before comparing byte-by-byte.
$endgroup$
– JiK
14 hours ago

1

$begingroup$
@RubenDeSmet: I would accept your assertion for CRC64 but not CRC32. I would have posted a similar answer as this one where we found CRC32 to be inadequate due to too high of hash collision rate.
$endgroup$
– Joshua
13 hours ago

Why use a second step after CRC32 at all? Compare the files byte-by-byte if you're going to read them again completely anyhow!

– Ruben De Smet
19 hours ago

@RubenDeSmet I think it's because to compare them byte-by-byte you'd have to buffer both files to a certain limit (because of memory constraints) and compare those. This will slow down sequential read speeds because you need to jump between the files. If this actually makes any real world difference provided a large enough buffer size is beyond my knowledge.

– JensV
15 hours ago

@JensV I am pretty sure that the speed difference between a byte-by-byte comparison and a SHA3 comparison (with reasonable buffer sizes) will be trivial. It might even favour the byte-by-byte comparison.

– Martin Bonner
15 hours ago

Comparing the files byte-by-byte requires communication. Computing a hash can be done locally. If the connection is slow compared to the hard drive speed, computing another hash after CRC32 might still be a reasonable option before comparing byte-by-byte.

– JiK
14 hours ago

@RubenDeSmet: I would accept your assertion for CRC64 but not CRC32. I would have posted a similar answer as this one where we found CRC32 to be inadequate due to too high of hash collision rate.

– Joshua
13 hours ago

|
show 4 more comments

draft saved

draft discarded

Thanks for contributing an answer to Cryptography Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Sfdwhf