Re: R packages

To: brook%biology.nmsu.edu@localhost (Brook Milligan)
Subject: Re: R packages
From: Greg Troxel <gdt%ir.bbn.com@localhost>
Date: Mon, 10 Oct 2011 08:36:30 -0400
brook%biology.nmsu.edu@localhost (Brook Milligan) writes:

> They are also almost entirely identical in structure.  Most of the
> remaining bits can be derived from the DESCRIPTION file provided by
> upstream.  That is why, in fact, this tool is even reasonable.  It is
> creating highly idiomatic packages based on the already highly
> effective factoring that has been done.  Indeed, in the case of
> existing R packages the tool generates essentially identical Makefiles
> (minus manual hand tuning like USE_LANGUAGES or buildlink3 inclusion,
> but things like those are maintained when existing packages are
> updated).

That sounds fine then.  To me the main point is that as much common
subexpression elimination as is reasonable has already been done.

>  > In this case, it would be like having a python category for all the
>  > py-foo scripts, and then perl, ruby, etc..  I'd say each R-foo package
>  > should go where it should go, in the existing categories.
>
> Yes, there is definitely an analogy with the python and perl packages.
> However, I think there is a compelling argument that makes that
> analogy less useful in this case.  First, to my knowledge there are no
> tools creating python and perl packages from upstream information;

That's an artifact of the current situation.  Regardless, there are 706
py- and 1939 p5- packages.  I expect that there will be fewer R packages
than that.

> perhaps there should be, but that is another issue.  Second, it is not
> clear how to discover the appropriate category to use, as there is
> generally no corresponding information in an individual package's
> DESCRIPTION file and even if there was there is no guarrantee that it
> would make sense within the context of the pkgsrc categories.  Thus,
> the tool cannot easily divine what category to use.  Third, it is

I think it is a fundamental error to adjust the pkgsrc hierarchy to
accomodate a pkg-generating tool so that can it be used without thought.
One of the core pkgsrc ideas, not often discussed, is that packages are
curated.  pkgsrc maintainers choose what to package, choose the category
(this part is weak), choose appropriate dependencies and options,
regularlize the behavior into the pkgsrc layout and startup scripts,
choose when to upgrade, and thus present to users a "now this works as
it should, and it's what you should run if you haven't understood the
details" version.  I find this aspect of pkgsrc very valuable.

The person making an R package should read the description and choose a
category.  (This is perhaps a reason not to use the --recurse option.)

> important to include dependency information in the generated
> Makefiles.  If R packages are scattered about in various directories,
> then it will be needlessly difficult to find them and generate
> appropriate DEPENDS clauses.  For these reasons I feel it is

That's only a few lines of code.  I can't believe that this is really a
big problem encountered only because R is so special -- that hasn't come
up in the first ~10K packages.

> appropriate to keep the R packages together in a single category as
> they are now; given how many there may be, however, it seems that a
> distinct category may be appropriate.  However, if people are happy
> having a math category dominated by R packages I suppose that is fine;
> to me it seems to be a miscategorization rather than a help, though.

Trying to save a minute of the packager's time and imposing a category
that otherwise shouldn't exist seems like a very bad tradeoff.  Packager
cycles are arguably more valuable than user cycles, but clearly not
infinitely so.  (We've probably spent as long discussing this as it
takes to choose categories for the first 100 packages, if not 500.)

>  > If it is not entirely clear what the actual text of the license is from
>  > data in the upstream distribution, upstream is broken and you should
>  > file a bug report.   Someone who says "Permission granted to copy under
>  > the BSD license." is being unclear.
>
> I am aware of that uncertainty.  I am only trying to provide a means
> of identifying automatically when the upstream terminology is
> unclear.  Whoever is creating or updating a particular package should
> be looking at the generated Makefile and should see the commented
> LICENSE clause and can then notify upstream.

OK.  My point is that if upstream is confused/unclear, that's how it is,
and pkgsrc can't fix it.  Maybe pkgsrc should cope with this better, but
it's not about R specifically.

>  > --recurse: I can see the point, but it seems like the right thing is to
>  >   default to off, and to fail with a list of the prereqs that are not
>  >   installed.
>
> The current behavior is to only create/update the explicitly requested
> packages unless --recurse is used in which case the dependencies are
> also created/updated.  Do you think it is important to list all the
> dependencies that were skipped because they were not requested?  Can
> you see an important use case for _not_ using the --recurse option?
> That is, when would you not want the dependencies created/updated?

I've done something similar, by hand, for python packages, when I
packaged tahoe-lafs.  I ended up making about 10 packages.  For each, I
had to examine it with pkglint, test the install, etc.  So being told to
deal with the next level by running the script again wouldn't be
annoying.

Creating packages for dependencies seems ok.  But you need to say where
they go, which requires showing the description to a human and having
them choose a category.

Updating is another matter.  Presumably you mean creating package foo is
found to need bar>=D, and bar is only at C (<D).   That's an entirely
different matter, because you have to go read the bar NEWS and decide if
updating from C to D (or E>D) will break existing other packages.
Perhaps R has a culture of API compatibility and this isn't such a
problem, but in general it's an issue.

So in general I favor making the packager think about everything that
needs thought.  If I could do one package every 5 minutes (from nothing
to committed) that would be blindingly fast, and I'm not sure I'd want
to impose packages on the community with less than that much thought
anyway.

>> Confused; there  is no such thing as an "MIT" license.  MIT has used a
>> number of licenses, and  usually when people say "MIT license" they mean
>> "X11 license".
>
> This is one I thought I had a correct match for.  Pkgsrc does define
> licenses/mit which I take to be the same as what one describes as the
> MIT license.  If this is a problem it is unclear how any mapping is

I see this notion (that "MIT license" refers to the text of the X11
license pretty widely, so I'm probably off in insisting that "MIT
license" is a bad term (or rather that it's ambiguous).

But reading:

  http://en.wikipedia.org/wiki/MIT_License

I end up not being sure which license is which, and the FSF makes a good
case that the term is ambiguous:
  http://www.gnu.org/licenses/license-list.html

"MIT" refers to both the expat license and the x11 license:
  http://www.gnu.org/licenses/license-list.html
and
  http://www.xfree86.org/3.3.6/COPYRIGHT2.html#3

The one in /usr/pkgsrc/licenses/mit is the expat version.  Fortunately,
they only differ in that the X11 license adds a no-use-of-name clause,
and this is below the level at which the pkgsrc licensing framework is
intended to work.

Again - if you can go to upstream and say "Your tag says 'MIT'.  Please
show me the text." and then compare it to what's in pkgsrc, all is
well.  If you can't, then you don't know the terms of the license.
This is the real issue, not what the name is in either scheme.

> possible.  Are you suggesting that the tool do some sort of textual
> comparison between some distributed license file and the pkgsrc
> licenses?  Would it be better to comment out all the LICENSE clauses

No, I am saying that the point of the tool is to do repetitive work
which does not require human judgement, and that work that needs a human
should be left for the human.   I am also saying that the notion that
everything can and should be automated is an assertion, not a
supportable conclusion.

> regardless of whether there is a plausible match?  Would it be better
> to produce two commented out LICENSE clauses, one with the upstream
> descriptor and one with the best guess from pkgsrc, to aid the
> developer?  Presumably, people have to look at the output of this

If you can't figure out the text of the license from upstream, it isn't
possible to get this right.

> stuff and make some judgements as to whether the tool did the right
> thing in any particular instance.  I hope nobody would create a
> zillion R packages and commit them without some appropriate scrutiny.

Agreed.

>  > So there perhaps one needs to maybe add a license, or (manually) wdiff
>  > to find one that's textually equivalent.
>
> The point is to discover cases that can be handled automatically and
> to indicate via a #LICENSE clause the cases that cannot.  Those will
> have to be investigated manually by whoever is working with a package.
> Yes, the upstream terminology leaves much to be desired.  I see no way
> around this but at least can more or less flag a set of cases that
> need manual intervention.  The point of being able to print out the
> mapping table is that the manual intervention might actually be to add
> something to the table for a case that is well-defined but that I have
> not yet discovered.  Perhaps your point is that all cases need enough
> manual intervention that every package should have #LICENSE.

No, I think if upstream has a tagging scheme and it can be understood
that automatic translation is possible and entirely reasonable.   But if
a human can't undertake to understand and say "upstream X means pkgsrc
Y, and if that isn't the case then either upstream or pkgsrc will view
it as a definite bug", then you can't translate automatically.

Is the R license tag a clue for their packaging system, or how licenses
are defined?  If I download a source package, is there actual license
text?  If so, that's what counts, not some metadata put on by someone
else.
Attachment: pgpvNMgMdVXRZ.pgp
Description: PGP signature
References:
- R packages
  - From: Brook Milligan
- Re: R packages
  - From: Greg Troxel
- Re: R packages
  - From: Brook Milligan
Prev by Date: ruby versions, accepted and otherwise
Next by Date: daily pkgsrc CVS update output
Previous by Thread: Re: R packages
Next by Thread: Re: R packages
Indexes:
Home | Main Index | Thread Index | Old Index