Normalization – Project Haystack

Normalization

OverviewPipelineScanParseResolveTaxonifyDefxNormalize TagsInheritValidationGenerate

Overview

Normalization is the process of compiling libs to a namespace. Remember that every def is modeled as a dict (just a normal collection of name/value pairs). The defs packaged inside libs are called declared defs. Normalization will modify the tags of each def yielding a new dict which we call the effective def. Effective defs are used to compute documentation, reflection, def aware filters, etc.

Note: normalization is only required for software that wishes to build a namespace from source libs. Pre-normalized defs can be downloaded from Project-Haystack. These downloads may also be used as test cases to verify your own normalization software.

Pipeline

The process to compile libs to a namespace is composed of the following ordered steps:

  1. Scan: traverse input libs to find trio files
  2. Parse: parse each trio file to discover def and defx dicts
  3. Resolve: ensure every symbol tag maps to a def
  4. Taxonify: compute taxonomy tree of supertype/subtypes
  5. Defx: add defx tags to each declared def
  6. Normalize Tags: normalize def tags
  7. Inherit: recursively apply supertype tags into subtypes
  8. Validation: perform additional sanity checks
  9. Generate: output Namespace

Each of these steps is discussed in detail in the following sections.

Scan

The input to normalization is a list of lib files. These are the zip files that contain the declared defs. Each zip file is opened and searched for applicable Trio files in the "lib/" directory. There must be exactly one dict defined in "lib/lib.trio" with the lib's meta.

Parse

Every input Trio file found during scan is parsed to discover the declared def dicts identified with the def tag. This phase is also used to discover our def extensions which are identified by the defx tag. Every dict parsed must have either the def or defx tag and the value must be a symbol. During this phase, we detect and report illegal duplicate symbols - a given symbol must be mapped to only one def in all the input libs.

Resolve

During this phase, we walk every tag in every def and defx to resolve symbolic references. For each def, the following must resolve in its lib namespace:

  1. Every tag name must resolve to a tag def
  2. Every symbol used as a tag value must resolve
  3. Every list of symbols in a tag value must individually resolve

We compute each lib namespace by computing only the symbols in the lib's scope. These are the symbols defined by the lib itself and those imported via its depends. Dependencies are not transitive - a given depend does not imply including the referenced lib's dependencies.

Example:

// lib alpha
def: ^lib:alpha
depends: [^lib:ph]
--
def: ^alphaTag
is: ^marker

// lib beta
def: ^lib:beta
depends: [^lib:ph, ^lib:alpha]
--
def: ^betaTag
is: ^marker

// lib gamma
def: ^lib:beta
depends: [^lib:ph, ^lib:beta]

The lib namespace to use for resolution would be as follows:

alpha: all symbols in ph, alphaTag (in its own lib)
beta: all symbols in ph, betaTag (in its own lib), alphaTag (from include)
gamma: all symbols in ph, betaTag (from include)

Note that gamma does not have visibility to alphaTag because alpha is not explicitly included.

Resolution is performed on def and defx dicts separately. This allows defx dicts to reference symbols that aren't in scope of the source def.

Taxonify

All of the following steps require knowing the taxonomy tree to determine each tag's supertypes. This phase should walk through each def and recusively derive its supertypes based on each def's is tag. Specifically:

  1. compute the supertype/subtype tree based on the is tag
  2. infer the is on feature key defs
  3. verify every non-feature key has is tag with exception of following root tags: marker, val, and feature

Example:

// before
def: ^filetype:json
mime: "application/json"

// after
def: ^filetype:json
mime: "application/json"
is: [^filetype]

Defx

After the resolve and taxonify steps complete, we can apply our defx extensions to their respective defs. Each defx is used to add one or more tags to a source def identified by the defx tag. The defx mechanism provides for late binding of def metadata. Just like RDF allows anyone to add a triple to a given subject, anyone can add tags to a def using a defx.

Every defx must reference a def within its scope and must only add new tags. It is illegal for a defx to specify a tag declared by the def itself or by another defx. The exception to this rule is tags annotated as accumulate which should be aggregated into a list.

Here is example to illustrate:

// source def
def: ^date
is: ^scalar

// two defx dicts declared in separate libs
defx: ^date
acmeTerm: "ISODate"
--
defx: ^date
wombatFormatter: "DateFormatter"

// after defx are merged, effective def is
def: ^date
is: ^scalar
acmeTerm: "ISODate"
wombatFormatter: "DateFormatter"

Normalize Tags

This step normalizes specific tags according to the following rules:

  1. Add the inferrred lib tag to each def
  2. Normalize def tags which subtype from list

Every def has an inferred lib tag that is a reference to its declaring lib. It is invalid for any def to declare its own lib tag. The lib meta itself also receives a lib tag referencing itself. Example:

// before (within phIoT lib)
def: ^equip
is: ^entity

// after
def: ^equip
is: ^entity
lib: ^lib:phIoT

As a convenience, tags such as is and tagOn can use a single symbol instead of list of symbols. However, their normalized representation must always be a list. This phase should iterate every def and normalize any tag that subtypes from list. Example:

// before
def: ^geoCity
is: ^str
tagOn: ^geoPlace

// after
def: ^geoCity
is: [^str]
tagOn: [^geoPlace]

Inherit

The inherit phase applies tag inheritance from supertypes:

  1. Start with the def tags from previous steps
  2. Each supertype is processed in order of declaration in the is tag
  3. The supertype must recursively have its own inheritance normalized
  4. Filter out any tags from the supertype marked as notInherited
  5. Inherit all tags from previous step that are not yet declared nor inherited from other supertypes.
  6. If the tag is marked as accumulate then inheritance is aggregated.

In the case of ambiguity via multiple inheritance, a subtype should explicitly declare the tag value.

Here is a fictitious example for an El Camino, which is a hybrid between a car and a pickup truck:

// declarations
def: ^car
numDoors: 4
color: "red"
engine: "V8"
----
def: ^pickup
numDoors: 2
color: "blue"
bedLength: 80in
----
def: ^elCamino
is: [^pickup, ^car]
color: "purple"

The normalized definition with inheritance would be:

def: ^elCamino          // declared
is: [^pickup, ^car]     // declared
color: "purple"         // declared
numDoors: 2             // inherited from pickup first
engine: "V8"            // inherited from car
bedLength: 80in         // inherited from pickup

Validation

Before generation, normalization software should validate tag rules to flag errors not caught in previous steps:

  1. Verify every lib meta def has required tags
  2. Verify tag values match their def's declared types
  3. Verify no def named index, which is reserved for documentation purposes
  4. Verify every term in a conjunct is a marker
  5. Verify no defs declare a computed tag
  6. Verify choice of is subtype of marker
  7. Verify tagOn only used on a tag defs (not conjuncts or feature keys)
  8. Verify relationship tags are only used on defs that subtype from ref

Generate

Once all previous steps have completed successfully, we can generate our namespace. From a logical perspective, a namespace is a map of symbols to effective def dicts. In actual implementations, this probably yields specific data structures. These data structures can then be used to perform additional computations such as reflection.