[Crm-sig] ISSUE: representing compound name strings

Conal Tuohy conal.tuohy at gmail.com
Sun Nov 25 07:06:59 EET 2018

I think it's potentially helpful to encode compound data such as personal
names using XML literals in an RDF graph, for display purposes, but not for
SPARQL querying. For efficient querying, I don't see any good alternative
to providing separate literals for the individual components of the name,
such as with "foreName", "surname", etc properties in separate RDF triples.
I suggest that RDF encoding guidelines could suggest adopting both
practices (i.e. redundant representation both as parts and also as a whole).

On Fri, 23 Nov 2018 at 03:53, Martin Doerr <martin at ics.forth.gr> wrote:

> Dear Richard, Robert,
> It is simply wrong that encoding structured data into an rdfs:Literal
> makes it invisible to SPARQL. It is exactly what xsd:dateTime does. The
> year, month, etc., is available to querying individually in SPARQL, not by
> magic but by a standard extension mechanism.

The date functions in SPARQL that allow an xsd:dateTime literal to be
parsed into months, days, etc, are not really an extension to SPARQL; they
are part of the SPARQL language standard:
Because they are a standard data type in SPARQL, a SPARQL processor can
achieve efficiencies by normalizing them (to a standard time zone) and
using the normalized form in comparisons.

The SPARQL specification does allow for SPARQL implementations to have
"extension" functions, though, and to extend the operation of built-in
SPARQL operators such as "<" or "=", so hypothetically a SPARQL store might
offer XPath-evaluation functions to query inside XML literals, analogously
to the way that the REGEX and REPLACE functions do with string literals.
This kind of hybrid RDF graph/XML tree model could be supported effectively
by a SPARQL store which maintained indices of the tree structure of the XML
literal objects it contained. I believe Virtuoso actually has such a
feature, and there may well be other SPARQL engines with a similar feature,
but I personally think it would be unhelpful for the CRM to suggest an
approach that depends on such a non-standardised extension.

> It is a question to IT experts to tell us how to upload into the SPARQL
> code the respective string functions for other compounds.

The standard SPARQL string functions (including regular expression) can be
used to parse "compound" string literals, though not to parse XML literals,
in general, since XML is not a regular language. Of course the CIDOC CRM
could suggest "regular" XML encodings for particular types of compound
literals; for example a "persName" data type could be defined and
constrained with a regular expression to require that it begins with
"<persName xmlns='http://www.tei-c.org/ns/1.0'>" and ends with
"</persName>", optionally containing child elements beginning with
"<forename>" and ending with "</forename>", and even for these elements to
have attributes (such as 'type') drawn from a particular value space. They
could be queried using SPARQL string functions e.g. like so:

SELECT ?person
   ?person tei:persName ?persName.
   FILTER(CONTAINS(?persName, '<foreName>Richard</foreName>'))

However, relying on SPARQL FILTER and string-parsing would be grossly
inefficient in terms of query performance, compared to querying individual
properties, e.g.

SELECT ?person
   ?person tei:foreName 'Richard'.

If the "compound" XML literals are not intended for fine-grained querying,
they can still be valuable for display purposes, but I don't see much value
in constraining them beyond the general "XML literal" datatype. An
information system that understands XML literals can examine the XML and
process it appropriately based on its namespace.

Conal Tuohy
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ics.forth.gr/pipermail/crm-sig/attachments/20181125/3cfa6c0b/attachment-0001.html>

More information about the Crm-sig mailing list