Cyberduck Mountain Duck CLI

#10278 closed enhancement (fixed)

Optimize MD5 checksum calculation

Reported by: allklier Owned by: dkocher
Priority: normal Milestone: 7.2
Component: s3 Version: 6.4.1
Severity: normal Keywords:
Cc: Architecture:
Platform: macOS 10.12

Description

Two suggestions to optimize checksum calculation while uploading to S3.

I frequently upload very large files (75-100GB) to S3 and the checksum calculation adds a significant delay in a time sensitive workflow. I was just uploading a 75GB file, and the checksum calculation took 10min before the actual upload started. Actual upload time is 32min, so that adds a 33% time penalty in uploading, which is significant and very unfortunate.

  • Compute the checksum during the upload, rather than a separate pre-calc pass. Yes, that reduces redundancy of the checksum because it becomes a single read, but errors are more likely during upload than local disk read.
  • The algorithm for reading the file for checksum calculation seems slow. My primary storage (RAID5) supports read bandwidth in excess of 400MB/s, yet during the calculation of the checksum the read speed never exceeds 120MB/s, so checksum calculation is limited by code not I/O bandwidth.

Change History (12)

comment:1 Changed on Mar 15, 2018 at 8:41:07 PM by dkocher

  • Component changed from core to s3
  • Owner set to dkocher

comment:2 follow-up: Changed on Aug 24, 2018 at 3:45:28 PM by jamshid

Agreed, it would be great if there was a way to disable the checksum because it takes too long on >100GB files. I notice people complain about it e.g. https://community.rackspace.com/general/f/general-discussion-forum/1775/cyberduck-incredibly-slow

PS: (looking through some code changes I see you already know this) but although the ETag calculation is not officially defined by Amazon, the resulting ETag is a completed multipart upload is an MD5 of each part's MD5 followed by "-" and the number of parts. This could be verified if you're paranoid, though I guess it would have to be recomputed if Cyberduck is restarted in the middle of a multipart upload.

Last edited on Aug 24, 2018 at 4:20:26 PM by jamshid (previous) (diff)

comment:3 in reply to: ↑ 2 Changed on Oct 31, 2018 at 9:47:03 AM by dkocher

Replying to jamshid:

Agreed, it would be great if there was a way to disable the checksum because it takes too long on >100GB files. I notice people complain about it e.g. https://community.rackspace.com/general/f/general-discussion-forum/1775/cyberduck-incredibly-slow

PS: (looking through some code changes I see you already know this) but although the ETag calculation is not officially defined by Amazon, the resulting ETag is a completed multipart upload is an MD5 of each part's MD5 followed by "-" and the number of parts. This could be verified if you're paranoid, though I guess it would have to be recomputed if Cyberduck is restarted in the middle of a multipart upload.

We already do compute the returned concatenated MD5 hash for multipart uploads (see S3MultipartUploadService.

comment:4 Changed on Oct 31, 2018 at 9:47:11 AM by dkocher

  • Milestone set to 7.0
  • Status changed from new to assigned

comment:5 Changed on Oct 31, 2018 at 9:47:26 AM by dkocher

See also #10215.

comment:6 Changed on Oct 31, 2018 at 9:49:29 AM by dkocher

  • Summary changed from Optimize Checksum Calculation to Optimize MD5 checksum calculation

comment:7 Changed on Oct 31, 2018 at 9:51:27 AM by dkocher

We should possibly skip checksum calculation for multipart uploads as we additionally checksum parts verified on the server side.

comment:8 Changed on Nov 22, 2018 at 10:20:02 AM by dkocher

  • Resolution set to duplicate
  • Status changed from assigned to closed

comment:9 Changed on Oct 18, 2019 at 1:52:31 PM by dkocher

  • Milestone changed from 7.0 to 8.0
  • Resolution duplicate deleted
  • Status changed from closed to reopened

comment:10 Changed on Oct 31, 2019 at 8:20:46 AM by yla

  • Resolution set to fixed
  • Status changed from reopened to closed

In r48025.

comment:11 Changed on Nov 15, 2019 at 11:35:13 AM by dkocher

  • Milestone changed from 8.0 to 7.1.3

comment:12 Changed on Nov 22, 2019 at 9:36:20 AM by dkocher

  • Milestone changed from 7.1.3 to 7.2

Milestone renamed

Note: See TracTickets for help on using tickets.
swiss made software