Opened 10 years ago

Last modified 9 months ago

#16373 assigned enhancement

base should maintain a persistent working copy for all supported VCS fetches

Reported by: ryandesign (Ryan Schmidt) Owned by: raimue (Rainer Müller)
Priority: Normal Milestone: MacPorts 2.6.0
Component: base Version: 1.7.0
Keywords: performance fetch Cc: jeremyhu (Jeremy Huddleston Sequoia), nerdling (Jeremy L), nonstop.server@…, cooljeanius (Eric Gallager), mojca (Mojca Miklavec), jul_bsd@…, anddam (Andrea D'Amore), Schamschula (Marius Schamschula), xuchunyang (Chunyang Xu), raimue (Rainer Müller)
Port:

Description

"fetch.type svn" is inefficient in that it checks out a new working copy every time, directly to the work area. That would be like a normal port downloading the distfile every time. Instead, we should check out a working copy to that port's distpath, and then in the extract phase we should svn export it to the work area.

Some checks will be needed in the fetch phase to ensure that an existing working copy:

  • has no modifications: check svn status. Ideally we would try to clean up the working copy, for example by svn reverting modified or added or deleted files, and then in a second svn status run, delete any unversioned files. But it's already an improvement if we just discard the working copy if svn status --ignore-externals produces any output.
  • is from the right URL: check svn info: check if the "URL" is the one we want. If not, check that the "Repository Root" is a substring of the repository we want. If yes, try to svn switch to the URL and revision we want; if not, discard the working copy.

So the fetch phase would go something like...

if {working copy exists} {
	if {working copy has modifications} {
		delete working copy
	}
}
if {working copy exists} {
	if {working copy url is the one we want} {
		svn update to the desired revision
	} else {
		if {working copy repository root matches beginning of desired url} {
			try to svn switch to the desired url and revision
			if {an error occurred} {
				delete working copy
			}
		} else {
			delete working copy
		}
	}
}
if {working copy doesn't exist} {
	check out working copy
}

And the extract phase is simply to svn export the working copy from the distpath to the worksrcpath. (There is one problem if the working copy has externals and the user is using Subversion earlier than 1.5, for example Subversion 1.4.whatever which is included with Leopard. But rather than spend time working around this in base, I think this is a case where the port should depend on MacPorts subversion.)

Attachments (1)

xchm.Portfile (2.4 KB) - added by mojca (Mojca Miklavec) 3 years ago.
Example of a portfile that creates a tarball from git on the fly

Download all attachments as: .zip

Change History (40)

comment:1 Changed 7 years ago by jeremyhu (Jeremy Huddleston Sequoia)

Keywords: performance fetch added
Summary: svn fetch type should maintain a persistent working copysvn git and hg fetch type should maintain a persistent working copy

This should be done for mercurial and git as well. It's quite annoying to have to redownload sources every time through my debug itteration even though they haven't changed.

comment:2 Changed 7 years ago by jeremyhu (Jeremy Huddleston Sequoia)

Cc: jeremyhu@… added

Cc Me!

comment:3 Changed 7 years ago by nerdling (Jeremy L)

Cc: snc@… added

Cc Me!

comment:4 Changed 7 years ago by nonstop.server@…

Cc: nonstop.server@… added

Cc Me!

comment:5 Changed 6 years ago by cooljeanius (Eric Gallager)

Cc: egall@… added

Cc Me!

comment:6 Changed 6 years ago by mojca (Mojca Miklavec)

Cc: mojca@… added

Cc Me!

comment:7 Changed 5 years ago by jul_bsd@…

Cc: jul_bsd@… added

Cc Me!

comment:8 Changed 5 years ago by mojca (Mojca Miklavec)

I'm looking at options for git.

The following commands result in stable checksums:

git archive {shasum_or_branch} > /path/to/name_version.tar
gzip < /path/to/name_version.tar > /path/to/name_version.tar.gz
git archive {shasum_or_branch} > /path/to/name_version.tar
gzip -n /path/to/name_version.tar
git archive {shasum_or_branch} | gzip -n > /path/to/name_version.tar.gz
git archive {shasum_or_branch} | xz > /path/to/name_version.tar.xz

The first option results in a different checksum that the other two. I didn't try to understand the difference in the approaches, but in either case that would allow users to store the resulting compressed file, verify the checksums and store the file on MacPorts' server.

(Optionally the resulting file could be touched to get the same timestamp as the contents, but that's not a strict requirement.)

comment:9 Changed 5 years ago by ryandesign (Ryan Schmidt)

I'm not sure how this relates to this ticket. The solution I'm envisioning for this issue (in Subversion parlance, though I'm sure git and hg have equivalent concepts) is maintaining a persistent working copy which would be updated and switched as needed, or in extreme cases deleted and recreated, not creating any tarball, keeping any checksums, or uploading any file to a MacPorts server.

comment:10 Changed 5 years ago by nerdling (Jeremy L)

For git, we could store the downloaded repository in the distfiles directory. If local repo doesn't exist git clone, or if local repo exists git reset --hard && git pull.

This repo can then be locally cloned or checked out to the working directory.

comment:11 Changed 5 years ago by mojca (Mojca Miklavec)

Sure, keeping the whole repository (and cleaning it in case it turns out to be "broken" or changed in unexpected ways) would be the optimal solution, but the solution I was talking about would probably be a lot faster to implement: it would be similar to what the GitHub PortGroup does for example. It fetches a .tar.gz file from GitHub (even though it could clone the git repo) and calculates the checksums. If the checksum matches, all is well and a copy of that file gets mirrored on one of the MacPorts server.

The solution I suggest would:

  • check if ${distfile} exists
  • if not, clone the git repository and create a ${name.version}.tar.gz/${name.version}.tar.xz of the desired branch/tag/version in ${distpath}, delete the temporary git clone
  • verify the checksums, extract the contents as usual ...

So something similar to what GitHub and BitBucket PortGroup already do (except that those fetch the distfiles from the server already).

I mentioned this because I believe it would be relatively easy to implement and it would allow to keep a mirror of a particular version on the server.

comment:12 Changed 5 years ago by mojca (Mojca Miklavec)

The problem is that I'm now trying to push some projects into making GitHub clones just for the sake of being able to avoid constant re-fetching of the sources from a random git repository. I would be really really grateful if MacPorts would get the ability to store the old repository and/or to mirror snapshots in the form of .tar.[gz|bz2|xz] files.

I suspect that solution would need to be implemented for each system separately anyway (different commands for svn, git and hg). I wanted to push the issue to start with git which is probably most widely used.

I would like to add a new port and I'm trying to figure out whether I should:

  • make an unofficial mirror on GitHub (in my user account)
  • deal with the pain of re-fetching from the original repository
  • or make sure that the issue gets fixed in MacPorts

I would prefer the last one.

comment:13 Changed 5 years ago by neverpanic (Clemens Lang)

Keeping a (bare repo) clone of the whole thing would speed up fetching even after a port is updated, though. Packaging tarballs wouldn't. Also we can't easily avoid the git dependency because by the time the fetch phase is started we wouldn't know whether our mirrors already had a generated tarball or we'd have to fetch from git.

I guess getting this implemented using bare clones wouldn't be so hard after all. For git, you'd have to

  • generate a unique identifier from the repository URL (e.g. using a hash function)
  • test whether $cachedir/$identifier is a valid git repository
  • create a bare clone if it isn't, run git fetch if it is
  • export the version/revision/tag you need from $cachedir/$identifier into $worksrcdir.

I think that's actually easier to implement than getting the mirroring stuff you propose into the scripts that update our distfile mirrors.

comment:14 in reply to:  13 ; Changed 5 years ago by mojca (Mojca Miklavec)

Replying to cal@…:

Keeping a (bare repo) clone of the whole thing would speed up fetching even after a port is updated, though. Packaging tarballs wouldn't.

Yes, that would be a huge benefit over tarballs.

Also we can't easily avoid the git dependency because by the time the fetch phase is started we wouldn't know whether our mirrors already had a generated tarball or we'd have to fetch from git.

I don't think that getting rid of the dependency on git would be of any substantial benefit.

I guess getting this implemented using bare clones wouldn't be so hard after all. For git, you'd have to

  • generate a unique identifier from the repository URL (e.g. using a hash function)
  • test whether $cachedir/$identifier is a valid git repository
  • create a bare clone if it isn't, run git fetch if it is
  • export the version/revision/tag you need from $cachedir/$identifier into $worksrcdir.

I would also suggest to add/check the SHA sum of the commit (even when dealing with tags) just to be on the safe side.

I think that's actually easier to implement than getting the mirroring stuff you propose into the scripts that update our distfile mirrors.

I'm too clumsy when it comes to tcl (I've learnt to handle the Portfiles, but changing anything in base is still too complex for me).

I would be thrilled if someone would be willing and able to implement this.

Once that gets implemented – how would you handle GitHub and BitBucket from that point on? And how would you handle situations when the servers go offline? Would you mirror the bare repository on one of MacPorts servers? (This is of course less important.)

comment:15 in reply to:  14 Changed 5 years ago by nerdling (Jeremy L)

Replying to mojca@…:

I would also suggest to add/check the SHA sum of the commit (even when dealing with tags) just to be on the safe side.

Using commitish over tags is helpful and uniform.

Once that gets implemented – how would you handle GitHub and BitBucket from that point on? And how would you handle situations when the servers go offline? Would you mirror the bare repository on one of MacPorts servers? (This is of course less important.)

There's no need to mirror their repositories. The authors can easily host it elsewhere and we simply update the portfile.

comment:16 Changed 5 years ago by mojca (Mojca Miklavec)

I believe that both the SHA sum and the tag should be present. Tag doesn't always represent the exact version number (sometimes the version needs to be set separately for a github project anyway, but is often clear and helpful, often even for livecheck).

I wasn't talking about moving the git repositories. I meant situations when the server is not accessible for several days. Or when the sources disappear completely (there are certain tar.gz files that are only present on MacPorts mirrors and can still be installed, but are otherwise long gone from web).

Last edited 5 years ago by mojca (Mojca Miklavec) (previous) (diff)

comment:17 in reply to:  16 Changed 5 years ago by nerdling (Jeremy L)

Replying to mojca@…:

I believe that both the SHA sum and the tag should be present. Tag doesn't always represent the exact version number (sometimes the version needs to be set separately for a github project anyway, but is often clear and helpful, often even for livecheck).

And sometimes tags are never used.

I wasn't talking about moving the git repositories. I meant situations when the server is not accessible for several days. Or when the sources disappear completely (there are certain tar.gz files that are only present on MacPorts mirrors and can still be installed, but are otherwise long gone from web).

So we have two issues here: it's not a distfile, and keeping the whole repo would mean we have to manage history rewrites on our servers.

Last edited 5 years ago by nerdling (Jeremy L) (previous) (diff)

comment:18 in reply to:  16 ; Changed 5 years ago by ryandesign (Ryan Schmidt)

Replying to mojca@…:

I wasn't talking about moving the git repositories. I meant situations when the server is not accessible for several days. Or when the sources disappear completely (there are certain tar.gz files that are only present on MacPorts mirrors and can still be installed, but are otherwise long gone from web).

I consider that scenario to be outside the scope of this ticket.

If I get around to working on this issue, I would begin with the Subversion portion, since that's the version control system I'm most familiar with.

Last edited 5 years ago by ryandesign (Ryan Schmidt) (previous) (diff)

comment:19 Changed 5 years ago by mojca (Mojca Miklavec)

OK, it could be a mandatory SHA sum and an optional tag (or maybe this needs a bit of rethinking). One thing that I would also like to see supported out of the box (but is otherwise completely independent and also outside of scope of this ticket) is creating a version string like 3.14-beta-20140314-{short_SHA}. I mean: provided a full SHA string, I would like to be able to extract both date (just for "sorting" the increasing version) and a shortened version of the SHA sum.

But keeping a copy on MacPorts mirrors is definitely a lower priority than getting this functionality to work in the first place.

comment:20 in reply to:  18 Changed 5 years ago by nerdling (Jeremy L)

Replying to ryandesign@…:

If I get around to working on this issue

Could you give further guidance on this so that others who aren't as familiar with base might try to help out?

comment:21 Changed 5 years ago by neverpanic (Clemens Lang)

I'm not sure we really need a mandatory SHA sum. We currently trust git (or any other version control system) to do the right thing automatically when specifying tags (and not using github or setting fetch.type git). I'm also not sure how to implement a SHA sum of a complete source tree.

As for the version string, try git describe, it might generate what you want.

This needs to be implemented in browser:trunk/base/src/port1.0/portfetch.tcl; there are a couple of procs named portfetch::${vcs}fetch where this would have to be implemented.

comment:22 in reply to:  21 Changed 5 years ago by ryandesign (Ryan Schmidt)

Replying to cal@…:

This needs to be implemented in browser:trunk/base/src/port1.0/portfetch.tcl; there are a couple of procs named portfetch::${vcs}fetch where this would have to be implemented.

Currently, when using a non-distfile fetch.type, they fetch directly into workpath, and the extract phase does nothing; the extract phase would also have to be updated to do something.

comment:23 Changed 5 years ago by mojca (Mojca Miklavec)

Replying to cal@…:

I'm also not sure how to implement a SHA sum of a complete source tree.

One option is to generate a .tar or a .tar.[gz|bz2|xz] and calculate the checksum of that. There are other options for sure.

As for the version string, try git describe, it might generate what you want.

I meant something that would easily be accessible in Tcl, so that I could specify something like

    git.branch ...sha...
    version "3.14-beta-${git.commitdate}-${git.shortsha}"

I would need to learn how to interface git and Tcl first to implement that.

comment:24 Changed 4 years ago by larryv (Lawrence Velázquez)

Cc: larryv@… added

Cc Me!

comment:25 Changed 4 years ago by larryv (Lawrence Velázquez)

Cc: larryv@… removed
Owner: changed from macports-tickets@… to larryv@…

comment:26 Changed 4 years ago by larryv (Lawrence Velázquez)

Status: newassigned

comment:27 Changed 4 years ago by dbevans (David B. Evans)

Note this is an issue with ports that use bzr fetches as well such as inkscape-devel.

comment:28 Changed 4 years ago by larryv (Lawrence Velázquez)

Summary: svn git and hg fetch type should maintain a persistent working copybase should maintain a persistent working copy for all supported VCS fetches

comment:29 Changed 4 years ago by anddam (Andrea D'Amore)

Cc: and.damore@… added

Cc Me!

comment:30 Changed 4 years ago by Schamschula (Marius Schamschula)

Cc: mschamschula@… added

Cc Me!

comment:31 Changed 4 years ago by xuchunyang (Chunyang Xu)

Cc: xuchunyang56@… added

Cc Me!

comment:32 Changed 3 years ago by mojca (Mojca Miklavec)

Can you please take a look at the attached Portfile for xchm? (Never mind the fact that it leads to a build error later on.)

I copy-pasted some code from portutil.tcl and portfetch.tcl. This is the relevant part:

checksums           rmd160  ... \
                    sha256  ...

use_xz              yes

pre-fetch {
    if {![file exists ${distpath}/${distname}${extract.suffix}]} {
        set git_dir ${workpath}/git

        # clone the git repository
        set options "-q"
        set cmdstring "${git.cmd} clone $options ${git.url} ${git_dir} 2>&1"
        ui_debug "Executing: $cmdstring"
        if {[catch {system $cmdstring} result]} {
            return -code error [msgcat::mc "Git clone failed"]
        }

        # create a tarball
        set xz [findBinary xz ${portutil::autoconf::xz_path}]
        set cmdstring "${git.cmd} archive ${git.branch} --prefix=${distname}/ | ${xz} > ${distpath}/${distname}${extract.suffix}"
        ui_debug "Executing: $cmdstring"
        if {[catch {system -W ${git_dir} ${cmdstring}} result]} {
            return -code error [msgcat::mc "Git archive failed"]
        }
    }
}

It works like explained months ago:

  • In case the sources are missing, it will clone the repository and make a tarball out of it and store it to ${distpath}.
  • (In extract phase the sources are extracted from the tarball, they are not taken from the git repository.)
  • Next time when the sources are needed, it will simply extract everything from the tarball, no need for a new clone and for consumption of a precious bandwidth.
  • I assume that the buildbots / other servers would then also automatically keep a mirror of these tarballs, so it would no longer be a problem if the git repository goes offline.

However, this code should go to a portgroup (or possibly to core, depending on where the other code resides) and I'm not too comfortable writing the code for that yet.

Can someone please provide some feedback about this approach and in case that the approach sounds reasonable, possibly help me rewrite the code?

Changed 3 years ago by mojca (Mojca Miklavec)

Attachment: xchm.Portfile added

Example of a portfile that creates a tarball from git on the fly

comment:33 Changed 3 years ago by ryandesign (Ryan Schmidt)

This assumes ${distname} is sufficiently unique, which it likely isn't. This particular port sets version to include the version number, ${git.branch} and a date, but the default for distname is ${name}-${version}, and it is common for projects that fetch from git to update their commit hash while the version and of course the name stay the same. The distname should probably be changed to be ${name}-${git.branch} for git, ${name}-r${svn.revision} for subversion, etc.

Last edited 3 years ago by ryandesign (Ryan Schmidt) (previous) (diff)

comment:34 Changed 3 years ago by mojca (Mojca Miklavec)

See also #50708.

comment:35 Changed 3 years ago by raimue (Rainer Müller)

Cc: raimue@… added
Last edited 3 years ago by raimue (Rainer Müller) (previous) (diff)

comment:36 Changed 3 years ago by raimue (Rainer Müller)

Implementation started in ^/branches/vcs-fetch/base/.

comment:37 Changed 10 months ago by neverpanic (Clemens Lang)

Milestone: MacPorts FutureMacPorts 2.5.0

We would like to see this in 2.5.0.

comment:38 Changed 10 months ago by neverpanic (Clemens Lang)

Owner: changed from larryv to raimue

comment:39 Changed 9 months ago by neverpanic (Clemens Lang)

Milestone: MacPorts 2.5.0MacPorts 2.6.0

Our plan is to merge this right after the 2.5.0 branch.

Note: See TracTickets for help on using tickets.