I think it's potentially helpful to encode compound data such as personal names using XML literals in an RDF graph, for display purposes, but not for SPARQL querying. For efficient querying, I don't see any good alternative to providing separate literals for the individual components of the name, such as with "foreName", "surname", etc properties in separate RDF triples. I suggest that RDF encoding guidelines could suggest adopting both practices (i.e. redundant representation both as parts and also as a whole).

On Fri, 23 Nov 2018 at 03:53, Martin Doerr <martin@ics.forth.gr> wrote:
Dear Richard, Robert,

It is simply wrong that encoding structured data into an rdfs:Literal makes it invisible to SPARQL. It is exactly what xsd:dateTime does. The year, month, etc., is available to querying individually in SPARQL, not by magic but by a standard extension mechanism.

The date functions in SPARQL that allow an xsd:dateTime literal to be parsed into months, days, etc, are not really an extension to SPARQL; they are part of the SPARQL language standard:
<https://www.w3.org/TR/2013/REC-sparql11-query-20130321/#func-date-time>. Because they are a standard data type in SPARQL, a SPARQL processor can achieve efficiencies by normalizing them (to a standard time zone) and using the normalized form in comparisons.

The SPARQL specification does allow for SPARQL implementations to have "extension" functions, though, and to extend the operation of built-in SPARQL operators such as "<" or "=", so hypothetically a SPARQL store might offer XPath-evaluation functions to query inside XML literals, analogously to the way that the REGEX and REPLACE functions do with string literals. This kind of hybrid RDF graph/XML tree model could be supported effectively by a SPARQL store which maintained indices of the tree structure of the XML literal objects it contained. I believe Virtuoso actually has such a feature, and there may well be other SPARQL engines with a similar feature, but I personally think it would be unhelpful for the CRM to suggest an approach that depends on such a non-standardised extension.
 
It is a question to IT experts to tell us how to upload into the SPARQL code the respective string functions for other compounds.

The standard SPARQL string functions (including regular expression) can be used to parse "compound" string literals, though not to parse XML literals, in general, since XML is not a regular language. Of course the CIDOC CRM could suggest "regular" XML encodings for particular types of compound literals; for example a "persName" data type could be defined and constrained with a regular expression to require that it begins with "<persName xmlns='http://www.tei-c.org/ns/1.0'>" and ends with "</persName>", optionally containing child elements beginning with "<forename>" and ending with "</forename>", and even for these elements to have attributes (such as 'type') drawn from a particular value space. They could be queried using SPARQL string functions e.g. like so:

SELECT ?person
WHERE {
   ?person tei:persName ?persName.
   FILTER(CONTAINS(?persName, '<foreName>Richard</foreName>'))
}

However, relying on SPARQL FILTER and string-parsing would be grossly inefficient in terms of query performance, compared to querying individual properties, e.g.

SELECT ?person
WHERE {
   ?person tei:foreName 'Richard'.
}

If the "compound" XML literals are not intended for fine-grained querying, they can still be valuable for display purposes, but I don't see much value in constraining them beyond the general "XML literal" datatype. An information system that understands XML literals can examine the XML and process it appropriately based on its namespace.


--
@conal_tuohy
+61-466-324297