Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize MD5 checksum calculation #10278

Closed
cyberduck opened this issue Mar 15, 2018 · 6 comments
Closed

Optimize MD5 checksum calculation #10278

cyberduck opened this issue Mar 15, 2018 · 6 comments
Assignees
Labels
enhancement fixed s3 AWS S3 Protocol Implementation
Milestone

Comments

@cyberduck
Copy link
Collaborator

4d59c66 created the issue

Two suggestions to optimize checksum calculation while uploading to S3.

I frequently upload very large files (75-100GB) to S3 and the checksum calculation adds a significant delay in a time sensitive workflow. I was just uploading a 75GB file, and the checksum calculation took 10min before the actual upload started. Actual upload time is 32min, so that adds a 33% time penalty in uploading, which is significant and very unfortunate.

  • Compute the checksum during the upload, rather than a separate pre-calc pass. Yes, that reduces redundancy of the checksum because it becomes a single read, but errors are more likely during upload than local disk read.
  • The algorithm for reading the file for checksum calculation seems slow. My primary storage (RAID5) supports read bandwidth in excess of 400MB/s, yet during the calculation of the checksum the read speed never exceeds 120MB/s, so checksum calculation is limited by code not I/O bandwidth.
@cyberduck
Copy link
Collaborator Author

4e36ae0 commented

Agreed, it would be great if there was a way to disable the checksum because it takes too long on >100GB files. I notice people complain about it e.g. https://community.rackspace.com/general/f/general-discussion-forum/1775/cyberduck-incredibly-slow

PS: (looking through some code changes I see you already know this) but although the ETag calculation is not officially defined by Amazon, the resulting ETag is a completed multipart upload is an MD5 of each part's MD5 followed by "-" and the number of parts. This could be verified if you're paranoid, though I guess it would have to be recomputed if Cyberduck is restarted in the middle of a multipart upload.

@cyberduck
Copy link
Collaborator Author

@dkocher commented

Replying to [comment:2 jamshid]:

Agreed, it would be great if there was a way to disable the checksum because it takes too long on >100GB files. I notice people complain about it e.g. https://community.rackspace.com/general/f/general-discussion-forum/1775/cyberduck-incredibly-slow

PS: (looking through some code changes I see you already know this) but although the ETag calculation is not officially defined by Amazon, the resulting ETag is a completed multipart upload is an MD5 of each part's MD5 followed by "-" and the number of parts. This could be verified if you're paranoid, though I guess it would have to be recomputed if Cyberduck is restarted in the middle of a multipart upload.
We already do compute the returned concatenated MD5 hash for multipart uploads (see S3MultipartUploadService.

@cyberduck
Copy link
Collaborator Author

@dkocher commented

See also #10215.

@cyberduck
Copy link
Collaborator Author

@dkocher commented

We should possibly skip checksum calculation for multipart uploads as we additionally checksum parts verified on the server side.

@cyberduck
Copy link
Collaborator Author

@ylangisc commented

In 0b68abc.

@cyberduck
Copy link
Collaborator Author

@dkocher commented

Milestone renamed

@iterate-ch iterate-ch locked as resolved and limited conversation to collaborators Nov 26, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement fixed s3 AWS S3 Protocol Implementation
Projects
None yet
Development

No branches or pull requests

2 participants