ShExStatements: Documentation

ShExStatements allows the users to generate shape expressions from simple CSV statements and files. shexstatements can be also be used from the command line.

Objectives

  • Easily generate shape expressions (ShEx) from CSV files
  • Simple syntax, with 5 columns
    • Node name
    • Property
    • Allowed values
    • Cardinality (optional)
    • Comments (optional)

Setup

Clone the ShExStatements repository.


$ git clone https://github.com/johnsamuelwrites/ShExStatements.git

Go to ShExStatements directory.


$ cd ShExStatements

Virtual Environment

Install modules required by ShExStatements (here: installing into a virtual environment).

$ python3 -m venv .venv
$ source ./.venv/bin/activate
$ pip3 install .

Consider an example CSV file language.csv in the folder examples/. The file contains an example description of a language on Wikidata. This file uses comma as a delimiter to separate the values.


wd,<http://www.wikidata.org/entity/>,,,
wdt,<http://www.wikidata.org/prop/direct/>,,,
xsd,<http://www.w3.org/2001/XMLSchema#>,,,

@language,wdt:P31,wd:Q34770,,# instance of a language
@language,wdt:P1705,LITERAL,,# native name
@language,wdt:P17,.,+,# spoken in country
@language,wdt:P2989,.,+,# grammatical cases
@language,wdt:P282,.,+,# writing system
@language,wdt:P1098,.,+,# speakers
@language,wdt:P1999,.,*,# UNESCO language status
@language,wdt:P2341,.,+,# indigenous to

There are five columns in the CSV file.

  • Column 1 is used for specifying the node name, starting with @.
  • Column 2 for specifying the property value
  • Column 3 for a set of possible values
  • Column 4 for cardinality (+,*)
  • Column 5 for comments. Comments start with #.

Columns 1, 2, 3 are mandatory. Column 3 can be a special value like . (period to say 'any' value). The first three lines in the above file are used for specifying the prefixes. In this case, columns 3,4 and 5 are empty.

Cardinality can be any one of the following values

  • * : zero or more values
  • + : one or more values
  • m : m number of values
  • m,n : any number of values between m and n (including m and n).

Run the following command for the above file.


$ ./shexstatements.sh examples/language.csv

The shape expression generated by ShExStatements will look like


PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
start = @<language>
<language> {
  wdt:P31 [ wd:Q34770  ] ;# instance of a language
  wdt:P1705 LITERAL ;# native name
  wdt:P17 .+ ;# spoken in country
  wdt:P2989 .+ ;# grammatical cases
  wdt:P282 .+ ;# writing system
  wdt:P1098 .+ ;# speakers
  wdt:P1999 .* ;# UNESCO language status
  wdt:P2341 .+ ;# indigenous to
}

CSV file can use delimiters like semicolon (;). Take for example, the following command works with a file using semi-colon as a delimiter.


$ ./shexstatements.sh examples/languagedelimsemicolon.csv --delim ";"

But sometimes, users may like to specify the header. In that case, they can make use of -s or --skipheader to tell the generator to skip the header (firsrt line of CSV).


$ ./shexstatements.sh --skipheader examples/languageheader.csv

ShExJ

Use -j or --shexj to generate ShEx JSON Syntax (ShExJ) instead of default ShEx Compact syntax (ShExC).


$ ./shexstatements.sh --shexj examples/language.csv

The output will be similiar to:


{
  "type": "Schema",
  "start": "language",
  "shapes": [
    {
      "type": "Shape",
      "id": "language",
      "expression": {
        "type": "EachOf",
        "expressions": [
          {
            "type": "TripleConstraint",
            "predicate": "http://www.wikidata.org/prop/direct/P31",
            "valueExpr": {
              "type": "NodeConstraint",
              "values": [
                "http://www.wikidata.org/entity/Q34770"
              ]
            }
          },
          {
            "type": "TripleConstraint",
            "predicate": "http://www.wikidata.org/prop/direct/P1705",
            "valueExpr": {
              "type": "NodeConstraint",
              "nodeKind": "literal"
            }
          },
          {
            "type": "TripleConstraint",
            "predicate": "http://www.wikidata.org/prop/direct/P17",
            "min": 1,
            "max": -1
          },
          {
            "type": "TripleConstraint",
            "predicate": "http://www.wikidata.org/prop/direct/P2989",
            "min": 1,
            "max": -1
          },
          {
            "type": "TripleConstraint",
            "predicate": "http://www.wikidata.org/prop/direct/P282",
            "min": 1,
            "max": -1
          },
          {
            "type": "TripleConstraint",
            "predicate": "http://www.wikidata.org/prop/direct/P1098",
            "min": 1,
            "max": -1
          },
          {
            "type": "TripleConstraint",
            "predicate": "http://www.wikidata.org/prop/direct/P1999",
            "min": 0,
            "max": -1
          },
          {
            "type": "TripleConstraint",
            "predicate": "http://www.wikidata.org/prop/direct/P2341",
            "min": 1,
            "max": -1
          }
        ]
      }
    }
  ]
}

It's also possible to use application profiles of the following form

Entity_name,Property,Property_label,Mand,Repeat,Value,Value_type,Annotation

and Shape expressions can be generated using the following form

$ ./shexstatements.sh -ap --skipheader examples/languageap.csv

Test cases and Code Coverage

All the test cases can be run in the following manner.

          
$ python3 -m tests.tests
          
        
Code coverage report can also be generated by running the unit tests using the coverage tool.
          
$ coverage run --source=shexstatements -m unittest tests.test
$ coverage report -m
          
        

Examples

There are example CSV files in the examples folder.