Launchpad Entry: https://launchpad.net/products/bzr/+spec/bundle-v0.10
Created: 2006-08-15 by JohnMeinel
Contributors:
Summary
A new bundle serializer needs to be written, which can write out the proposed format, read it back in, validate and insert the contents into another Repository.
Rationale
The current Bundle format was a prototype, and is not suitable for performing well. It is inefficient in space consumption, and creation and parsing speed. Also, the current bundle implementation is very sensitive to whitespace munging. As most email transports do not guarantee whitespace preservation when sending emails (either inline, or as an attachment)
Further Details
Assumptions
Use Cases
The bundle format has 2 main use cases:
- As a human readable, email friendly patch-like object, which lets people review the changes, and apply them.
- As a transport mechanism for a smart server. By bundling the changes together, the number of round trips can be reduced.
Implementation
UI Changes
There should be minimal UI changes, since this is mostly an internal format change.
Bundle Design
This discusses the basic pattern of the new bundle format.
Sections
- Header
This is mostly just a simple # Bazaar bundle v0.10 header, which can be used to identify the format of the file. (It may be 0.10 or something, since a new bundle format will be needed for the unique-roots work.) We may also consider using the version of bzr that introduced this format, like we have started doing for branch format strings.
- Human Readable Patch (optional?) This section contains the information that a human would actually want to read. It might include things like the log messages. And certainly should include a unified diff of the changes between the source revision and the target revision.
Perhaps a diffstat header could be relevant here too --MatthieuMoy
- Binary encoded history The last section should be the actual history information bzip2 compressed. If the output is meant to be sent over email, it should also be base64 encoded to be safe going through email transports.
Requirements
There are a few requirements about the data format which should be explicitly stated.
- Merging/pulling a bundle should be as close to merging/pulling a real branch as possible. This means including history, as well as revision signatures, etc.
- If a human readable portion is present, it needs to be verified before a bundle is accepted into the target repository. This is because it is likely people will read over the diff, and make sure it makes sense, and then want to merge the bundle. But do not want to have to go through another review step.
- Whitespace becomes an issue because of (2). It could be possible for the validation step to verify the contents, modulo whitespace changes. At the very least, it should be possible to apply a bundle that has
invalid whitespace, possibly with a --ignore-whitespace type of parameter. It might also be possible to always ignore whitespace, but turn on more strict checking by using --strict.
- Testament checking: At present, when merging another branch, some sha1 sums are verified, but not everything is verified. Bundles currently support more verification than a plain merge. And while it does have a performance implication, sha1 sums should probably be checked at a minimum.
- It would probably be nice to have a Bundle appear as a Branch+Repository object to the rest of the codebase. Then all sorts of actions can just use the existing code to do the work. For example, you could do 'bzr log bundle', etc. To get the list of changes a bundle includes. One difficulty, though, is that a bundle needs context to be valid. So
for starters, it might be best to not have Branch.open() create a BundleBranch, but only allow BundleBranch to be created by the object returned by read_bundle_from_url.
Code Changes
This is just a brain dump of different implementation pieces.
- It would be possible to organize the bundle in the same way that we
organize a VersionedFile.join() operation. The steps are:
- Get a list of involved file ids
- Copy all of the file texts into the target repository
- Copy all of the inventories
- Copy the revision signatures
- Copy the revision entries
- It might be possible to have an in memory representation for the bundle, which just looks like another Repository, and have the code do a plain inter-repository join. This leverages the existing code base a little bit better, rather than re-implementing all of the work for inserting revisions into a repository.
- I am concerned about doing this because Repository is a very fat interface. It seems wasteful to implement most of it so that we can reuse a little bit of code. However, we might be able to abuse the Interrepository mechanism
to achieve code reuse without turning bundles into repositories. --AaronBentley
- I am concerned about doing this because Repository is a very fat interface. It seems wasteful to implement most of it so that we can reuse a little bit of code. However, we might be able to abuse the Interrepository mechanism
- For writing out the bundle contents, it should be possible to make heavy
use of VersionedFile.get_deltas(). This allows us to use the existing diff if present, rather than re-generating all of the diff contents. It also means that we do not need to extract the full text for every version of every file. These 2 operations make up 90% of the current bundle building time.
- The existing join() code doesn't do as much verification as it should. There are several stages where inconsistencies could be detected and the join() aborted. This is outside of the current spec, but a brief outline is that join() could be a 2 stage process. One stage to write the texts and verify their direct sha1 sums. And then a later stage to commit the indexes to be written. In this way, invalid texts would not be added to the repository, until they had been checked at higher levels. This would
be reasonably easy to implement by having _KnitIndex keep the offests cached in memory, and not writing out the final values until everything was approved. Theoretically, nothing should be committed until a Testament is checked, and possibly the revision signature has been validated.
Schema Changes
Data Migration
Discussion
Unresolved Issues
Questions and Answers
JamesBlackwell - One possiblity is to eschew human readable bundles, relying instead upon setting up a new mime type for bundles. This would allow for richer bundles that would be similiar to what smartservers would likely use and provide for more flexibility in bundle content.
Pure binary versus Human readable
If using the bundle for a smart server, there is no real need for it to be human readable. It only adds extra processing overhead on both sides. It would seem reasonable for bundles to be created that only contain the binary history portion. Whether this is an entirely different format version, or simply a flag inside the file itself. But a raw binary version would not need the history to be base64 encoded, and would not need to include the diff hunk at the top.
Note: the "diff hunk at the top" in v0.8 bundles is unique data, required to reconstruct the head revision.
We must make certain that it is impossible to create a "binary bundle" that looks like a human-readable bundle, to avoid attacks. In fact, a binary bundle would be most efficiently expressed as another format entirely-- perhaps a bunch of knit deltas. --AaronBentley
