                       How cvs2svn.py Works
                      =====================

A cvs2svn run consists of five passes.  The first three passes save
their data to files on disk, so that a) we don't hold huge amounts of
state in memory, and b) the conversion process is resumable.

Pass 1:
=======

The goal of this pass is to get a summary of all the revisions for
each file written out to 'cvs2svn-data.revs'; at the end of this
stage, revisions will be grouped by RCS file, not by logical commits.

We walk over the repository, processing each RCS file with
rcsparse.parse(), using cvs2svn's CollectData class, which is a
subclass of rcsparse.Sink(), the parser's callback class.  For each
RCS file, the first thing the parser encounters is the administrative
header, including the head revision, the principal branch, symbolic
names, RCS comments, etc.  The main thing that happens here is that
CollectData.define_tag() is invoked on each symbolic name and its
attached revision, so all the tags and branches of this file get
collected.

Next, the parser hits the revision summary section.  That's the part
of the RCS file that looks like this:

   1.6
   date	2002.06.12.04.54.12;	author captnmark;	state Exp;
   branches
   	1.6.2.1;
   next	1.5;
   
   1.5
   date	2002.05.28.18.02.11;	author captnmark;	state Exp;
   branches;
   next	1.4;

   [...]

For each revision summary, CollectData.define_revision() is invoked,
recording that revision's metadata in the self.rev_data[] tree.

After finishing the revision summaries, the parser invokes
CollectData.tree_completed(), which loops over the revisions in
self.rev_data, determining if there are instances where a higher
revision was committed "before" a lower one (rare, but it can happen
when there was clock skew on the repository machine).  If there are
any, it "resyncs" the timestamp of the higher rev to be just after
that of the lower rev, but saves the original timestamp in
self.rev_data[blah][3], so we can later write out a record to the
resync file indicating that an adjustment was made (this makes it
possible to catch the other parts of this commit and resync them
similarly, more details below).

Next, the parser encounters the *real* revision data, which has the
log messages and file contents.  For each revision, it invokes
CollectData.set_revision_info(), which writes a new line to
cvs2svn-data.revs, like this:

   3dc32955 5afe9b4ba41843d8eb52ae7db47a43eaa9573254 C 1.2 N * 0 0 foo/bar,v

The fields are:

   1. a fixed-width timestamp
   2. a digest of the log message + author
   3. the type of change ("C"hange, or "D"elete)
   4. the revision number
   5. "N" if this revision has non-empty deltatext, else "E" for empty
   6. the branch on which this commit happened, or "*" if not on a branch
   7. the number of tags rooted at this revision (followed by their
        names, space-delimited)  
   8. the number of branches rooted at this revision (followed by
        their names, space-delimited) 
   9. the path of the RCS file in the repository

(Of course, in the above example, fields 6 and 7 are "0", so they have
no additional data.)

Also, for resync'd revisions, a line like this is written out to
'cvs2svn-data.resync':

   3d6c1329 18a215a05abea1c6c155dcc7283b88ae7ce23502 3d6c1328

The fields are:

   NEW_TIMESTAMP   DIGEST   OLD_TIMESTAMP

(The resync file will be explained later.)

That's it -- the RCS file is done.

When every RCS file is done, Pass 1 is complete, and:

   - cvs2svn-data.revs contains a summary of every RCS file's
     revisions.  All the revisions for a given RCS file are grouped
     together, but note that the groups are in no particular order.
     In other words, you can't yet identify the commits from looking
     at these lines; a multi-file commit will be scattered all over
     the place.

   - cvs2svn-data.resync contains a small amount of resync data, in
     no particular order.

Pass 2:
=======

This is where the resync file is used.  The goal of this pass is to
convert cvs2svn-data.revs to a new file, 'cvs2svn-data.c-revs' (clean
revs).  It's the same as the original file, except for some resync'd
timestamps.

First, read the whole resync file into a hash table that maps each
author+log digest to a list of lists.  Each sublist represents one of
the timestamp adjustments from Pass 1, and looks like this:

   [old_time_lower, old_time_upper, new_time]

The reason to map each digest to a list of sublists, instead of to one
list, is that sometimes you'll get the same digest for unrelated
commits (for example, the same author commits many times using the
empty log message, or a log message that just says "Doc tweaks.").  So
each digest may need to "fan out" to cover multiple commits, but
without accidentally unifying those commits.

Now we loop over cvs2svn-data.revs, writing each line out to
'cvs2svn-data.c-revs'.  Most lines are written out unchanged, but
those whose digest matches some resync entry, and appear to be part of
the same commit as one of the sublists in that entry, get tweaked.
The tweak is to adjust the commit time of the line to the new_time,
which is taken from the resync hash and results from the adjustment
described in Pass 1.

The way we figure out whether a given line needs to be tweaked is to
loop over all the sublists, seeing if this commit's original time
falls within the old<-->new time range for the current sublist.  If it
does, we tweak the line before writing it out, and then conditionally
adjust the sublist's range to account for the timestamp we just
adjusted (since it could be an outlier).  Note that this could, in
theory, result in separate commits being accidentally unified, since
we might gradually adjust the two sides of the range such that they are
eventually more than COMMIT_THRESHOLD seconds apart.  However, this is
really a case of CVS not recording enough information to disambiguate
the commits; we'd know we have a time range that exceeds the
COMMIT_THRESHOLD, but we wouldn't necessarily know where to divide it
up.  We could try some clever heuristic, but for now it's not
important -- after all, we're talking about commits that weren't
important enough to have a distinctive log message anyway, so does it
really matter if a couple of them accidentally get unified?  Probably
not.

Pass 3:
=======

This is where we deduce the changesets, that is, the grouping of file
changes into single commits.

It's very simple -- run 'sort' on cvs2svn-data.c-revs, converting it
to 'cvs2svn-data.s-revs'.  Because of the way the data is laid out,
this causes commits with the same digest (that is, the same author and
log message) to be grouped together.  Poof!  We now have the CVS
changes grouped by logical commit.

In some cases, the changes in a given commit may be interleaved with
other commits that went on at the same time, because the sort gives
precedence to date before log digest.  However, Pass 4 detects this by
seeing that the log digest is different, and reseparates the commits.

Pass 4:
=======

In --dump-only mode, the result of this pass is a Subversion
repository dumpfile (suitable for input to 'svnadmin load').  The
dumpfile is the data's last static stage: last chance to check over
the data, run it through svndumpfilter, move the dumpfile to another
machine, etc.

However, when not in --dump-only mode, no full dumpfile is created for
subsequent load into a Subversion repository.  Instead, miniature
dumpfiles represent a single revision are created, loaded into the
repository, and then removed.

In both modes, the dumpfile revisions are created by walking through
cvs2svn-data.s-revs.

                  ===============================
                      Branches and Tags Plan.
                  ===============================

This pass is also where tag and branch creation is done.  Since
subversion does tags and branches by copying from existing revisions
(then maybe editing the copy, making subcopies underneath, etc), the
big question for cvs2svn is how to achieve the minimum number of
operations per creation.  For example, if it's possible to get the
right tag by just copying revision 53, then it's better to do that
than, say, copying revision 51 and then sub-copying in bits of
revision 52 and 53.

Also, since CVS does not version symbolic names, there is the
secondary question of *when* to create a particular tag or branch.
For example, a tag might have been made at any time after the youngest
commit included in it, or might even have been made piecemeal; and the
same is true for a branch, with the added constraint that for any
particular file, the branch must have been created before the first
commit on the branch.

Answering the second question first: cvs2svn creates tags and branches
as late as possible.  For branches, this is "just in time" creation --
the moment it sees the first commit on a branch, it snaps the entire
branch into existence (or as much of it as possible), and then outputs
the branch commit.

The reason we say "as much of it as possible" is that it's possible to
have a branch where some files have branch commits occuring earlier
than the other files even have the source revisions from which the
branch sprouts (this can happen if the branch was created piecemeal,
for example).  In this case, we create as much of the branch as we
can, that is, as much of it as there are source revisions available to
copy, and leave the rest for later.  "Later" might mean just until
other branch commits come in, or else during a cleanup stage that
happens at the end of this pass (about which more later).

All tags are created during the cleanup stage, after all regular
commits have been made.  That way there's no need to worry whether all
the required revisions for a particular tag have been committed yet,
and it's as correct as any other time, since no one can tell when a
tag was made anyway.

How just-in-time branch creation works:

In order to make the "best" set of copies/deletes when creating a
branch, cvs2svn keeps track of two sets of trees while it's making
commits:

   1. A skeleton mirror of the subversion repository, that is, an
      array of revisions, with a tree hanging off each revision.  (The
      "array" is actually implemented as an anydbm database itself,
      mapping string representations of numbers to root keys.)

   2. A tree for each CVS symbolic name, and the svn file/directory
      revisions from which various parts of that tree could be copied.

Both tree sets live in anydbm databases, using the same basic schema:
unique keys map to marshal.dumps() representations of dictionaries,
which in turn map entry names to other unique keys:

   root_key  ==> { entryname1 : entrykey1, entryname2 : entrykey2, ... }
   entrykey1 ==> { entrynameX : entrykeyX, ... }
   entrykey2 ==> { entrynameY : entrykeyY, ... }
   entrykeyX ==> { etc, etc ...}
   entrykeyY ==> { etc, etc ...}

(The leaf nodes -- files -- are also dictionaries, for simplicity.)

Both file and directory dictionaries store metadata under special keys
whose names start with "/", so they can always be distinguished from
entries (for example, search for "/mutable", "/openings", or
"/closings" in cvs2svn.py).

The repository mirror allows cvs2svn to remember what paths exist in
what revisions.  For each file path in a revision, it records what
tags and branches can sprout from that revision; when the file
changes, these attributes do not propagate to the new revision, since
the symbolic name isn't based on that revision.

The symbolic name trees are all stored in one db file, as paths, where
the first element in each path is the symbolic name, and the rest is
the full Subversion path to the file in question.  For example, if the
Subversion revision 7 is the root of branch 'Rel_1', this fact would
be recorded under the path

   '/Rel_1/myproj/trunk/lib/driver.c'

(the exact layout is dependent on the make_path() function in
cvs2svn.py, which may change).

   root_key  ==> { 'Rel_1' : 'a', ... }
   'a'       ==> { 'myproj' : 'b', ... }
   'b'       ==> { 'trunk : 'c', ... }
   'c'       ==> { 'lib' : 'd', ... }
   'd'       ==> { 'driver.c' : 'e', ... }
   'e'       ==> { }

The source revision is stored in the leaf node, and also in all the
parent nodes, in the manner described in the class documentation for
'SymbolicNameTracker'.  The special entries "/opening" and "/closing"
are not shown above, for brevity, but their values are where the
revision ranges are stored (that is, the ranges indicating when this
path could be copied from to produce the tag or branch in question).

When it's time to create a branch or tag, cvs2svn.py walks the
appropriate symbolic name tree, calculating the ideal source revision
for each subpath (see 'SymbolicNameTracker' for the exact algorithm)
and emitting the minimum number of copies to the dumpfile and to the
skeleton repository mirror.  As it goes, it marks each path as
emitted, so that we don't redo the same copies during the cleanup
phase later on.

At this point, the entire branch is done except for:

   1. Any source revisions that haven't yet been committed (this is
      a rare situation, but anyway such revisions will automatically
      be handled later by the same algorithm, invoked either due to
      another commit on the branch, or in the cleanup phase), and

   2. Files that were accidentally copied onto the branch as part of a
      subtree, but which don't actually belong on the branch, because
      the corresponding CVS file doesn't contain that tag.

We handle (2) by doing tree diffs between the newly copied tree in the
skeleton repository mirror, and the corresponding portion of the
symbolic name tree.  If the skeleton mirror has a file that's not in
the symbolic name tree, we emit a delete to the dumpfile and remove
that path from the skeleton mirror.

The cleanup phase happens after all regular changes have been
processed.  Just loop over the "root directory" of the symbolic name
tree, running the same creation algorithm on each name (we'll have to
distinguish between branches and tags, probably through a special
entry on the directory object), skipping parts of the tree already
marked as copied.


Pass 5:
=======

Unless we're skipping cleanup, remove all our intermediate files.




-*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*- -*-

Some older notes and ideas about cvs2svn.  Not deleted, because they
may contain suggestions for future improvements in design.

-----------------------------------------------------------------------

An email from John Gardiner Myers <jgmyers@speakeasy.net> about some
considerations for the tool.

------
From: John Gardiner Myers <jgmyers@speakeasy.net>                     
Subject: Thoughts on CVS to SVN conversion
To: gstein@lyra.org                                  
Date: Sun, 15 Apr 2001 17:47:10 -0700

Some things you may want to consider for a CVS to SVN conversion utility:

If converting a CVS repository to SVN takes days, it would be good for     
the conversion utility to keep its progress state on disk.  If the
conversion fails halfway through due to a network outage or power
failure, that would allow the conversion to be resumed where it left off
instead of having to start over from an empty SVN repository.

It is a short step from there to allowing periodic updates of a
read-only SVN repository from a read/write CVS repository.  This allows
the more relaxed conversion procedure:

1) Create SVN repository writable only by the conversion tool.
2) Update SVN repository from CVS repository.
3) Announce the time of CVS to SVN cutover.
4) Repeat step (2) as needed.
5) Disable commits to CVS repository, making it read-only.
6) Repeat step (2).
7) Enable commits to SVN repository.
8) Wait for developers to move their workspaces to SVN.
9) Decomission the CVS repository.

You may forward this message or parts of it as you seem fit.
------

-----------------------------------------------------------------------

Further design thoughts from Greg Stein <gstein@lyra.org>

* timestamp the beginning of the process. ignore any commits that
  occur after that timestamp; otherwise, you could miss portions of a
  commit (e.g. scan A; commit occurs to A and B; scan B; create SVN
  revision for items in B; we missed A)

* the above timestamp can also be used for John's "grab any updates
  that were missed in the previous pass."

* for each file processed, watch out for simultaneous commits. this
  may cause a problem during the reading/scanning/parsing of the file,
  or the parse succeeds but the results are garbaged. this could be
  fixed with a CVS lock, but I'd prefer read-only access.

  algorithm: get the mtime before opening the file. if an error occurs
  during reading, and the mtime has changed, then restart the file. if
  the read is successful, but the mtime changed, then restart the
  file.

* use a separate log to track unique branches and non-branched forks
  of revision history (Q: is it possible to create, say, 1.4.1.3
  without a "real" branch?). this log can then be used to create a
  /branches/ directory in the SVN repository.

  Note: we want to determine some way to coalesce branches across
  files. It can't be based on name, though, since the same branch name
  could be used in multiple places, yet they are semantically
  different branches. Given files R, S, and T with branch B, we can
  tie those files' branch B into a "semantic group" whenever we see
  commit groups on a branch touching multiple files. Files that are
  have a (named) branch but no commits on it are simply ignored. For
  each "semantic group" of a branch, we'd create a branch based on
  their common ancestor, then make the changes on the children as
  necessary. For single-file commits to a branch, we could use
  heuristics (pathname analysis) to add these to a group (and log what
  we did), or we could put them in a "reject" kind of file for a human
  to tell us what to do (the human would edit a config file of some
  kind to instruct the converter).

* if we have access to the CVSROOT/history, then we could process tags
  properly. otherwise, we can only use heuristics or configuration
  info to group up tags (branches can use commits; there are no
  commits associated with tags)

* ideally, we store every bit of data from the ,v files to enable a
  complete restoration of the CVS repository. this could be done by
  storing properties with CVS revision numbers and stuff (i.e. all
  metadata not already embodied by SVN would go into properties)

* how do we track the "states"? I presume "dead" is simply deleting
  the entry from SVN. what are the other legal states, and do we need
  to do anything with them?

* where do we put the "description"? how about locks, access list,
  keyword flags, etc.

* note that using something like the SourceForge repository will be an
  ideal test case. people *move* their repositories there, which means
  that all kinds of stuff can be found in those repositories, from
  wherever people used to run them, and under whatever development
  policies may have been used.

  For example: I found one of the projects with a "permissions 644;"
  line in the "gnuplot" repository. Most RCS releases issue warnings
  about that (although they properly handle/skip the lines).
