Background
I have several projects that do not use the SPDX license identifiers yet, and I want to add them, and if possible convert existing copyright and license headers. My initial thought was: "is there a tool to help convert the license to comply with SPDX?"
I looked it up.
There is one tool that closely matches with my requirements, reuse, which has options to set copyright, license identifier, and year.
$ reuse annotate \ --recursive \ --copyright "author <email>" \ --year $YEAR \ --license "license-name" \ .
Example of the result is like below,
@@ -1,7 +1,8 @@ +// SPDX-FileCopyrightText: 2018 M. Shulhan <ms@kilabit.info> // Copyright 2018 Shulhan <ms@kilabit.info>. All rights reserved. -// Use of this source code is governed by a BSD-style -// license that can be found in the LICENSE file. +// +// SPDX-License-Identifier: BSD-3-Clause
It does not remove the old copyright header, I think it is by design.
I can run a sed on those files, to remove the line that start with
"// Copyright".
However, there are other problems.
If multiple files have different copyright year, I need to run another sed
command to correct the years.
And, what if the file does not have copyright year?
We need to figure it out from git history the year its created,
$ git log --follow --format=%ad --date=format:%Y $FILE | tail -1
Some big projects, like Linux kernel, only set the SPDX license without changing the copyright. We can do that. Or, we can write another tool that help convert the license headers.
Since this is a long holidays, let’s take the hard way, writing a tool to convert the old headers to SPDX format. This should be simple right?
For each file in directory:
(1) If there is a line prefixed with "// SPDX", skip it continue to the next
file.
(2) If there is a line prefixed with "// Copyright", capture the year,
author, and email using regex, and replace it with
"// SPDX-FileCopyrightText: …"
(3) If there is a line contains "^//.*BSD-style" replace it with
"// SPDX-License-Identifier: BSD-3-Clause" and remove the line
that start with "// license …"
(4) If no "// Copyright" get the year using the above "git log" command and
insert the new "// SPDX-FileCopyrightText: …" using predefined value.
Turns out, there is another problem.
A file can be excluded from REUSE compliance if its ignored by git, using the ".gitignore" file. And that is why we write parser and checker for Gitignore in Go.
Specification
We use the gitignore(5) manual as specification for the implementation.
In short, the rules are as follow:
-
Each line is a pattern, that will be matched with file name or path.
-
Empty line is ignored.
-
Line started with '#' is a comment, unless it is escaped with backslash '\'.
-
Space before and after line are ignored, unless escaped with backslash '\'.
-
Character '/' is directory separator.
-
Special character '?' in the pattern means match one character except '/'.
-
Special character '*' in the pattern means match zero or more character except '/'.
-
A pattern that end with '/' only match with directory with the same name.
When reading the above rules, my first thought is that this is similar to filepath.Match.
I was wrong.
According to the example given in manual, a pattern "foo/" matches with
"foo" or "a/foo"; but, the result for filepath.Match is different,
fmt.Println(filepath.Match("foo/", "foo"))
fmt.Println(filepath.Match("foo/", "a/foo"))
// Output:
// false <nil>
// false <nil>
Even if we remove the trailing slash in pattern "foo", the output still not as expected,
fmt.Println(filepath.Match("foo", "foo"))
fmt.Println(filepath.Match("foo", "a/foo"))
// Output:
// true <nil>
// false <nil>
Continuing the rules, there are other special characters that do not inline with the [filepath.Match].
-
Special character '!' in the beginning of pattern means negation. A file or directory that is excluded by previous pattern, is included again if match with it.
-
A pattern "**/foo" means match any file or directory named "foo" with zero or more directory before it.
-
A pattern "foo/**" means match any file or directory inside directory "foo" but not directory named "foo" itself.
-
A pattern "foo/**/bar" means match file or directory named "bar" inside directory "foo", with zero or more directory in between.
Implementation
Based on the above specification, seems like a simple [filepath.Match] or [patch.Match] is not sufficient to handle the pattern.
We need to convert those patterns into a regex that complies with the above rules:
-
If the pattern end with '/', mark it as directory, and remove the trailing '/'.
-
Trim the "**/" at the beginning of pattern since it means anything before. Pattern "**/foo" or "**/**/foo" is equal to "foo".
-
Ignore the pattern if its end with empty string or only '*'.
-
Now, we need to detect if the pattern contains directory separator '/'. Lets find the index and store it as
$SEP_IDXfor later. -
Escape regex meta-characters '.', '+', '|', '(', and ')' with backslash '\'.
-
Replace single character '*' with regex "[^/]*" (accept zero or more characters except "/").
-
Replace single character '?' with regex "[^/]" (accept one character except "/").
-
Replace string "/**/" with regex "(/.*)?/" (accept zero or more directories in between).
-
Replace string "/**" with regex "/(.*)" (accept everything inside a directory)
-
Replace string "**" with regex "[^/]*" (second pass for '*')
-
Back to $SEP_IDX,
-
If no directory separator found, prepend the pattern with regex "(/.*)?/" (accept zero or more directories before).
-
if directory separator is in the beginning or middle of pattern, prepend the pattern with regex "^/?" (do not accept any directory before)
-
-
If the pattern is a directory (end with '/') as we mark before, append back the '/' with '$'; otherwise append regex "/?$" (accept file or directory).
For example, here is the list of pattern and its conversion to regex,
-
foo or **/foo ⇒ ^(.*/|/)?foo/?$
-
foo* ⇒ ^(.*/|/)?foo[^/]*/?$
-
foo? ⇒ ^(.*/|/)?foo[^/]/?$
-
foo/ or **/foo/ ⇒ ^(.*/|/)?foo/$
-
foo/** ⇒ ^(.*/|/)?foo/(.*)/?$
-
/foo ⇒ ^/?foo/?$
-
/foo/ ⇒ ^/?foo/$
-
foo/bar ⇒ ^(.*/|/)?foo/bar/?$
-
foo/bar/ ⇒ ^(.*/|/)?foo/bar/$
-
/foo/bar ⇒ ^/?foo/bar/?$
-
foo/**/bar ⇒ ^/?foo(/.*)?/bar/?$
The result of the implementation can be viewed here: lib/git.
The APIs are quite simple.
First, load the ".gitignore" from directory using
LoadGitignore(),
and then check if path is excluded using
IsIgnored().
func LoadGitignore(dir string) (ign *Gitignore)
LoadGitignore load the gitignore file inside directory dir. Any invalid
pattern will be ignored.
func (ign *Gitignore) IsIgnored(path string) bool
IsIgnored return true if the path is ignored by this Gitignore content.
The path is relative to Gitignore directory.
There is also a type
IgnorePattern
that one can import and use for other implementation, for example handling
path value in REUSE.toml annotations table.