1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
//
// Copyright (c) 2023 Nathan Fiedler
//

//! This crate implements multiple versions of the FastCDC content defined
//! chunking algorithm in pure Rust. A critical aspect of the behavior of this
//! algorithm is that it returns exactly the same results for the same input.
//!
//! To learn more about content defined chunking and its applications, see
//! [FastCDC: a Fast and Efficient Content-Defined Chunking Approach for Data
//! Deduplication](https://www.usenix.org/system/files/conference/atc16/atc16-paper-xia.pdf)
//! from 2016, as well as the subsequent improvements described in the
//! [paper](https://ieeexplore.ieee.org/document/9055082) from 2020
//!
//! ## Migration from pre-3.0
//!
//! If you were using a release of this crate from before the 3.0 release, you
//! will need to make a small adjustment to continue using the same
//! implementation as before.
//!
//! Before the 3.0 release:
//!
//! ```no_run
//! # use fastcdc::ronomon as fastcdc;
//! # use std::fs;
//! # let contents = fs::read("test/fixtures/SekienAkashita.jpg").unwrap();
//! let chunker = fastcdc::FastCDC::new(&contents, 8192, 16384, 32768);
//! ```
//!
//! After the 3.0 release:
//!
//! ```no_run
//! # use std::fs;
//! # let contents = fs::read("test/fixtures/SekienAkashita.jpg").unwrap();
//! let chunker = fastcdc::ronomon::FastCDC::new(&contents, 8192, 16384, 32768);
//! ```
//!
//! The cut points produced will be identical to previous releases as the
//! `ronomon` implementation was never changed in that manner. Note, however,
//! that the other implementations _will_ produce different results.
//!
//! ## Implementations
//!
//! This crate had started as a translation of a variation of FastCDC
//! implemented in the
//! [ronomon/deduplication](https://github.com/ronomon/deduplication)
//! repository, written by Joran Dirk Greef. That variation makes several
//! changes to the original algorithm, primarily to accomodate JavaScript. The
//! Rust version of this variation is found in the `ronomon` module in this
//! crate.
//!
//! For a canonical implementation of the algorithm as described in the 2016
//! paper, see the `v2016` crate.
//!
//! For a canonical implementation of the algorithm as described in the 2020
//! paper, see the `v2020` crate. This implementation produces identical cut
//! points as the 2016 version, but does so a bit faster.
//!
//! If you are using this crate for the first time, the `v2020` implementation
//! would be the most appropriate. It uses 64-bit hash values and tends to be
//! faster than both the `ronomon` and `v2016` versions.
//!
//! ## Examples
//!
//! A short example of using the fast chunker is shown below:
//!
//! ```no_run
//! use std::fs;
//! use fastcdc::v2020;
//! let contents = fs::read("test/fixtures/SekienAkashita.jpg").unwrap();
//! let chunker = v2020::FastCDC::new(&contents, 4096, 16384, 65535);
//! for entry in chunker {
//!     println!("offset={} size={}", entry.offset, entry.length);
//! }
//! ```
//!
//! The example above is using normalized chunking level 1 as described in
//! section 3.5 of the 2020 paper. To use a different level of chunking
//! normalization, replace `new` with `with_level` as shown below:
//!
//! ```no_run
//! use std::fs;
//! use fastcdc::v2020::{FastCDC, Normalization};
//! let contents = fs::read("test/fixtures/SekienAkashita.jpg").unwrap();
//! let chunker = FastCDC::with_level(&contents, 8192, 16384, 32768, Normalization::Level3);
//! for entry in chunker {
//!     println!("offset={} size={}", entry.offset, entry.length);
//! }
//! ```
//!
//! Notice that the minimum and maximum chunk sizes were changed in the example
//! using the maximum normalized chunking level. This is due to the behavior of
//! normalized chunking in which the generated chunks tend to be closer to the
//! expected chunk size. It is not necessary to change the min/max values, just
//! something of which to be aware. With lower levels of normalized chunking,
//! the size of the generated chunks will vary more. See the documentation of
//! the `Normalization` enum for more detail as well as the FastCDC paper.
//!
//! ## Minimum and Maximum
//!
//! The values you choose for the minimum and maximum chunk sizes will depend on
//! the input data to some extent, as well as the normalization level described
//! above. Depending on your application, you may want to have a wide range of
//! chunk sizes in order to improve the overall deduplication ratio.
//!
//! Note that changing the minimum chunk size will almost certainly result in
//! different cut points. It is best to pick a minimum chunk size for your
//! application that can remain relevant indefinitely, lest you produce
//! different sets of chunks for the same data.
//!
//! Similarly, setting the maximum chunk size to be too small may result in cut
//! points that were determined by the maximum size rather than the data itself.
//! Ideally you want cut points that are determined by the input data. However,
//! this is application dependent and your situation may be different.
//!
//! ## Large Data
//!
//! If processing very large files, the streaming version of the chunkers in the
//! `v2016` and `v2020` modules may be a suitable approach. They both allocate a
//! byte vector equal to the maximum chunk size, draining and resizing the
//! vector as chunks are found. However, using a crate such as `memmap2` can be
//! significantly faster than the streaming chunkers. See the examples in the
//! `examples` directory for how to use the streaming versions as-is, versus the
//! non-streaming chunkers which read from a memory-mapped file.

pub mod ronomon;
pub mod v2016;
pub mod v2020;