spdxconv

spdxconv is a program to convert existing licenses and copyrights into SPDX identifiers or insert new ones.

This program works in tandem with REUSE software.

Features:

Background

Converting the license and copyright in a project to become compliant with SPDX headers is very tedious work, especially if you have many files with different years, copyrights, and licenses.

This program helps to do that by using pattern matching, search, replace, and deletion.

Prerequisites

The following program is needed to build and install the tool:

Installation

The following command will build and install the program into your $GOBIN directory:

$ go install git.sr.ht/~shulhan/spdxconv/cmd/spdxconv@latest

To check the value of $GOBIN, run:

$ go env GOBIN

Usage

Converting to SPDX is a trial-and-error task. This program does not guarantee that the conversion will succeed in one cycle. To help with this, we provide three commands: init, scan, and apply.

The init command creates the spdxconv.cfg configuration in the current directory. This configuration file teaches the program how to scan and apply the license and copyright.

The scan command lists the files that need to be converted or inserted with SPDX identifiers into a file named spdxconv.report. Users can then inspect and modify the report to see which files need to proceed.

The apply command reads spdxconv.report and applies the license and copyright as stated.

Users can repeat the edit "spdxconv.cfg", scan, and apply commands multiple times until they are satisfied with the result.

The init command

The first thing to do is to generate the configuration file using:

$ spdxconv init

This create the spdxconv.cfg file in the current directory with the following content (subject to changes in the future),

[default]
license_identifier =
copyright_year =
file_copyright_text =
max_line_match = 10

[match-file-comment]
pattern = "^.*\\.(adoc|asciidoc|c|cc|cpp|cs|dart|go|h|hh|hpp|java|js|jsx)$"
pattern = "^.*\\.(jsonc|kt|kts|php|rs|sass|scss|swift|ts|tsx)$"
pattern = "^(.*/)?(go.mod|go.work)$"
prefix = "//"

[match-file-comment]
pattern = "^.*\\.(aff|aww|bash|csh|d2|dockerfile|env|gitignore|gitmodules|hcl|ipynb)$"
pattern = "^.*\\.(make|pl|pm|py|ps1|rb|sh|tf|toml|yaml|yml|zsh)$"
pattern = "^(.*/)?([Dd]ockerfile|[Mm]akefile|robots.txt)$"
# systemd.unit(5).
pattern = "^.*\\.(automount|device|mount|path|scope|service|slice|socket|swap|target|timer)$"
prefix = "#"

[match-file-comment]
pattern = "^.*\\.(css)$"
prefix = "/*"
suffix = "*/"

[match-file-comment]
pattern = "^.*\\.(fxml|gohtml|htm|html|html5|kml|markdown|md|xml)$"
prefix = "<!--"
suffix = "-->"

[match-file-comment]
pattern = "^.*\\.(lua|sql)$"
prefix = "--"

[match-file-comment]
pattern = "^.*\\.(rst)$"
prefix = ".."

[match-file-comment]
pattern = "^.*\\.(tex)$"
prefix = "%"

# File name that match with this pattern will have the ".license" file
# created.
[match-file-comment]
pattern = "^.*\\.(apk|app|bz2|exe|gz|tar|tgz|zip)$"
pattern = "^.*\\.(csv|doc|docx|json|pdf|ppt|pptx|xls|xlsx)$"
pattern = "^.*\\.(bmp|gif|ico|jpeg|jpg|png|svg|svgz|webp)$"
pattern = "^.*\\.(3gp|avi|flv|mkv|mp3|mp4|mpeg|mpg|mpg4)$"
pattern = "^.*\\.(acc|ogg|mp3)$"
pattern = "^(.*/)?(go.sum|go.work.sum)$"

[match-license]
pattern = "^(//+|#+|/\\*+|<!--+|--+)?\\s*(.*)governed by a BSD-style(.*)$"
license_identifier = BSD-3-Clause
delete_line_before = "^(//+|#+|/\\*+|<!--+|--+)$"
delete_line_after = "^(//+|#+|/\\*+|<!--+|--+)?\\s*license that can(.*)$"
delete_line_after = "^(//+|#+|\\*+/|--+>|--+)$"

[match-copyright]
pattern = "^(//+|#+|/\\*+|<!--+|--+)?\\s*Copyright\\s+(?<year>\\d{4}),?\\s+(?<author>.*)\\s+<(?<contact>.*)>.*$"
delete_line_before = "^(//+|#+|/\\*+|<!--+|--+)$"
delete_line_after = "^(//+|#+|\\*+/|--+>|--+)$"

The configuration use the ini file format.

You must fill in the [default] section before running other commands.

You can add match-file-comment, match-license and match-copyright section as required, or modify the existing one to match your use case.

For quick reference, here are several rules that you need to be aware of:

The next subsection explains the content of configuration file and how it affects the program during scan and apply.

The default section

This section defines the default license identifier, year, and copyright text to be inserted into a file if no match-license or match-copyright found.

The license_identifier sets the default license using one of SPDX license identifiers from https://spdx.org/licenses/ . For example, GPL-3.0-only.

The copyright_year sets the default year to be used in SPDX-FileCopyrightText. The year can be a single year (for example "2026"), a range of years (for example, "2000-2026"), or list of years separated by comma (for example, "2000,2001,2026"); as long as there are no spaces in between.

The file_copyright_text sets the default author and contact in SPDX-FileCopyrightText. For example, "John Doe <john.doe@example>".

You should fill the license_identifier, copyright_year, and file_copyright_text before continue running the program.

The max_line_match defines the number of lines to be searched at the top and bottom of the file for SPDX-* identifiers, and match-license pattern, and match-copyright pattern; before the program insert the default values. The default value is 10.

The match-file-comment section

The first thing that the program does is detect which comment prefix and suffix to be used when inserting SPDX identifiers.

For each pattern in the "match-file-comment" section, the program will match it against the file name to get the comment prefix and suffix.

User can add their own "match-file-comment" sections as they like or modify the existing ones.

The "match-file-comment" can have an empty prefix and suffix. That means if the file name matches, it will create new file with a ".license" suffix containing the SPDX identifiers, instead of inserting them into the file directly.

If the file name does not match one of the "match-file-pattern" entries, the file will be flagged as "unknown".

The match-license section

After program detects the file comment syntax to use, it searches for a line that matches with "SPDX-License-Identifier:".

If there is a match at the top or bottom, the scan will stop and continue to processing copyright.

If there is no match, it will search for a line that match with "pattern" regular expression. If a line matches, the value in "match-license::license_identifier" will replace the "default::license_identifier" value.

If "delete_line_before" or "delete_line_after" is defined, it will search for the pattern before and after the matched line and delete it. These can be defined zero or multiple times.

The match-copyright section

The match-copyright section defines the pattern to match old copyright text. The regex must contain named group to capture copyright year, author, and contact.

If no copyright year is found in the file, the program will derive the year from the date of the first commit in the history of the file using the Source Code Management (SCM). In git SCM, it will run "git log --follow file".

For example, given the following old copyright text,

// Copyright 2022, John Doe <john.doe@email>. All rights reserved.

we can capture the year, author, and contact using the following regex,

^//+\\s*Copyright\\s+(?<year>\\d{4}),?\\s+(?<author>.*)\\s+<(?<contact>.*)>.*$"

The match-copyright section can also contain zero or more delete_line_before and delete_line_after patterns.

The scan command

The scan command scans the files that need to be converted or inserted with SPDX identifiers in the current directory, recursively. The result is stored inside a report file named "spdxconv.report". No other files are modified during and after the scan completed.

Users can inspect and modify the report to exclude certain files to changes the behaviour of apply command. Deleting a line in the report means excluding the file from being processed.

The scan command works in the following way,

(0) Skip the file if it is ignored by git or already annotated in the REUSE.toml configuration.

(1) Check the file for SPDX-License-Identifier and SPDX-FileCopyrightText. If both exist, skip the file.

(2) If SPDX-License-Identifier line does not exist, find the old license using the match-license sections.

For each match-license in the configuration,

(2.1) If there is a match, record it as "match" and its line number into the report.

(2.2) If no match, use the default license from configuration, record it as "default" with "0" as line number in the report.

(3) If SPDX-FileCopyrightText line does not exist, find the old copyright text using the match-copyright sections.

For each match-copyright in the configuration,

(3.1) If there is a match, get the year, author, and contact; and record it as "match" and its line number into the report.

If the year is empty, try to get the year from the first commit of the file using "git log --follow ..." command. If no commit history or its not using git, use default copyright year from configuration.

(3.2) If there is no match, use default copyright year and text from configuration, and record it as "default" in the report.

The spdxconv.report file format

Each line in the report file is formatted using CSV and has several columns separated by comma,

path "," license_id "," idx_license_id "," year "," copyright_id ","
    idx_copyright_id "," comment_prefix "," comment_suffix

where each column has the following values,

path              = { unicode_char }

license_id        = "default" | "exist" | "match"
idx_license_id    = 1 * decimal_digit

year              = single_year { "," single_year }
                  | single_year "-" single_year

single_year       = 4 * decimal_digit

copyright_id      = "default" | "exist" | "match"
idx_copyright_id  = 1 * decimal_digit

The path column defines the path to the file.

The license_id column defines the license identifier to be used. The value is either,

The idx_license_id defines the line number in file where license_id is "exist" or "match". Positive value means match found at the top, and negative value means match found at the bottom.

The year column define the copyright year for the work. The value is either,

The copyright_id define the author and contact. The value is either,

The idx_copyright_id define the line number in file where copyright_id is "exist" or "match". Positive value means match found at the top, and negative value means match found at the bottom.

The comment_prefix and comment_suffix contains the prefix and suffix used as comment in the file.

The spdxconv.report file groups

Files are collected into four groups: regular, binary, unknown, and done. Each group is separated by line prefixed with "//spdxconv:" in the report:

//spdxconv:regular
...
//spdxconv:binary
...
//spdxconv:unknown
...
//spdxconv:done
...

Regular group: Files where the program can detect the comment syntax. Program will insert the new SPDX identifiers into the file using the comment syntax.

Binary group: Non-text file, for example images (like jpg, png) or executable files. The program will create a separate .license file. Inside those "$name.license" file, the new SPDX identifiers will be inserted as defined in the report.

Unknown group: Files where the program cannot detect the comment syntax. These files will not be processed; they are listed so user can inspect, modify the configuration, and rerun the scan command again in the next cycle.

Done group: Files that already have SPDX identifiers. File in regular and binary group that has been applied will be moved here.

The apply command

The apply command reads the spdxconv.report and applies the license and copyright to the files as stated.

Any failed operations will be logged to stdout.

Once a file from regular or binary group is successfully processed, it will be moved to the done group.

License

This software is licensed under GPL-3.0-only. See the file LICENSE for full text.

References

Links