FAQ
Why ScanCode?
We could not find an existing tool (open source or commercial) meeting our needs:
usable from the command line or as library
running on Linux, Mac and Windows
written in a higher level language such as Python
easy to extend and evolve
accurately detecting most licenses and copyrights
How is ScanCode different from Debian licensecheck?
At a high level, ScanCode detects more licenses and copyrights than licensecheck does, reporting more details about the matches. It is likely slower.
In more details: ScanCode is a Python app using a data-driven approach (as opposed to carefully crafted regex like licensecheck uses):
for license scan, the detection is based on a (large) number of license full texts (~2100) and license notices, mentions and variants (~32,000) and is data- driven as opposed to regex-driven. It detects and reports exactly where license text is found in a file. Just throw in more license texts to improve the detection.
for copyright scan, the approach is natural language parsing grammar; it has a few thousand tests.
licenses and copyrights are detected in texts and binaries
licenses and copyrights are also detected in structured package manifests
Licensecheck (available here for reference: https://metacpan.org/pod/App::Licensecheck ) is a Perl script using hand- crafted regex patterns to find typical copyright statements and about 50 common licenses. There are about 50 license detection tests.
A quick test (in July 2015, before a major refactoring, but for this may still be still valid) shows several things that are not detected by licensecheck that are detected by ScanCode.
How can I integrate ScanCode in my application?
More specifically, does this tool provide an API which can be used by us for the integration with my system to trigger the license check and to use the result?
In terms of API, there are two stable entry points:
The JSON output when you use it as a command line tool from any language or when you call the scancode.cli.scancode function from a Python script.
Otherwise the scancode.cli.api module provides a simple function if you are only interested in calling a certain service on a given file (such as license detection or copyright detection)
Can I install ScanCode in a Unicode path?
Yes and this is fully supported and tested. See https://github.com/aboutcode-org/scancode-toolkit/issues/867 for a previous bug that was preventing this.
There was a bug in virtualenv https://github.com/pypa/virtualenv/issues/457 that is now fixed and has been extensively tested for ScanCode.
The line numbers for a copyright found in a binary are weird. What do they mean?
When scanning binaries, the line numbers are just a relative indication of where a detection was found: there is no such thing as lines in a binary. The numbers reported are based on the strings extracted from the binaries, typically broken as new lines with each NULL character.
How does --license-text for ScanCode works exactly?
Is the matched text that gets included into the result exactly the lines of text
from the input file that are covered by the start_line and end_line
fields of the result? I.e., if I would post-process the input file and extract
start_line to end_line from it, would I get exactly the matched_text
contents? Or is there some more “magic” involved when populating the
matched_text field?
ScanCode is a bit smarter than just start and end line, as matching is based on words, not lines of the actual scanned text. And a whole line may not always be matched.
For instance with this command:
$ echo "Foo is a wonder piece of code. Licensed under the GPL. " \
"For support contact foo@bar.com " > tst
$ scancode --license --license-text --license-text-diagnostics --yaml - tst
...
license_detections:
- license_expression: gpl-1.0-plus
license_expression_spdx: GPL-1.0-or-later
matches:
- license_expression: gpl-1.0-plus
license_expression_spdx: GPL-1.0-or-later
from_file: tst
start_line: 1
end_line: 1
matcher: 2-aho
score: '100.0'
matched_length: 4
match_coverage: '100.0'
rule_relevance: 100
rule_identifier: gpl_85.RULE
rule_url: https://github.com/nexB/scancode-toolkit/tree/develop/src/licensedcode/data/rules/gpl_85.RULE
matched_text: Foo is a wonder piece of code. Licensed under the GPL.
For support contact foo@bar.com
matched_text_diagnostics: Licensed under the GPL.
...
then:
matched_textis based onstart_lineandend_linematched_text_diagnosticsis based on the exact matched words
Note that matched_text_diagnostics also includes “tagged” gaps or extra
unmatched words highlighted between the matched words.