there is no system but GNU and Linux is one of its kernels...

Crawling for License Files of Your Project Dependencies

We are currently in the process of deploying some closed source software and are facing the problem of accumulating all open source licenses for display in our user interface. As we have a whole bunch of projects written in go and nodejs I decided to write a small utility script that crawls all our projects and collects all licenses of direct dependencies.

I started with go dependencies and tried using a combination of find and yq to collect the dependencies from our glide.yaml files. This turned out to be quite unsuccessful as yq is way to bare-metal and cannot handle my holy requirements on a yaml-query language. So I ended up using yq to convert the yaml files to json and proceed with jq which is way more fancy and can do a whole lot of magic stuff.

To actually get the license content I am using the github API with some curl. This has one disadvantage as not all our dependencies are using a github import path. I am using some sed magic to convert these import paths to their github project paths which was actually quite successful. As github has a rate limiting on their API I have to login with my username and password to make all necessary requests.

I ended up with the following one-liner which I will integrate in a fancy utility script:

find -L "${FOLDER}" -name glide.yaml |
  sed -r \
      -e '/makefiles/d' \
      -e '/autobahnmeisterei/d' |
  xargs -i yq r {} 'import[*].package' -j |
  jq -r '.[]' 2>/dev/null |
  sort |
  uniq |
  sed -r \
      -e 's;golang\.org/x/(\w+);github.com/golang/\1;' \
      -e 's;k8s.io/(\w+);github.com/kubernetes/\1;' \
      -e '/robulab/d' -e '/EmbeddedEnterprises/d' |
  sed -r -e 's;github.com/([-A-Za-z0-9_/]+);\1;' |
  while read -r project
      curl -u "${username}:${password}" "https://api.github.com/repos/${project}/license" |
          jq -r '.content' |
          base64 -d |
          jq --arg project "${project}" -asR '{ project: $project, license: . }'
  done |
  jq -s '.'

After that I needed the same output for all our nodejs projects. This turns out to be way more complicated as node does not hold all dependencies in one field but rather spread into multiple dependency types. I learned about the // and * operator of jq for default values when a query results in null and merging of objects. After collecting all dependencies of a node project I had to find a way to get the content of the dependencies license files. I found no better way than searching for LICENSE files in node_modules. I am currently not using the SPDX license fields of the package.json as I need the license content. However, not all dependencies have a LICENSE file located in their repository and adding a fallback to the SPDX license text may be a valid option. Currently I am just producing a placeholder text for those dependencies.

After a lot of digging into jq syntax I ended up with the following scipt:

find "${FOLDER}" -name package.json |
  sed -r \
      -e '/node_modules/d' \
      -e '/dist/d' \
      -e '/makefiles/d' \
      -e '/Infrastructure/d' |
  while read -r project_path
    project_dir="$(dirname "${project_path}")"
    jq -r '.peerDependencies // {} * .devDependencies // {} * .dependencies // {} | keys[]' "${project_path}" |
    while read -r project
      if [[ -f "${project_dir}/node_modules/${project}/LICENSE" ]]
      elif [[ -f "${project_dir}/node_modules/${project}/LICENSE.md" ]]
      elif [[ -f "${project_dir}/node_modules/${project}/LICENSE.txt" ]]
        echo "{ \"project\": \"${project}\", \"license\": \"UNKNOWN\" }"

      jq --arg project "${project}" -asR '{ project: $project, license: . }' "${license}"
  done |
  jq -s 'unique_by(.project)'

In the end this solutions is currently good enough. I will use the produced json in my user interface to display a long list of open source licenses.