[crm-sig] CIMI-Harmony test

martin martin at ics.forth.gr
Tue Oct 2 14:02:54 EEST 2001


Sorry for cross-posting!

Dear CIMI friends,

In agreement with all sides, CIDOC and ICS-FORTH have assisted the CIMI-Harmony test with
mappings of the provided data to the CIDOC CRM model. We wish to express our particular gratitude to
the support we got from Harmony.

The ABC/Harmony CIMI Collaboration Project (http://www.cimi.org/public_docs/Harmony_long_desc.html)
stated besides others:

  "1.3 Goals & Objectives 

                    *   make progress on understanding how
                        to get effective interoperability
                        between metadata vocabularies. 
                    *   provide an additional dimension to the
                        testing of the CIDOC CRM 
   1.4 Expected Outcomes 
                              ......
		    *   demonstrable XML database 
                    *   identification of deficiencies in the
                        CIDOC CRM 


The basic work has been done by
ICS-FORTH on a voluntary base with limited own resources, so the results presented here can and will
be further improved. A technical report about the mapping method will be published in a few days.

So far we have mapped the data sample from the National Museum of Denmark, the Museum of Natural
History London (Clayton Herbarium), and Australian Museums On-Line. We did not have the resources to
address the RLG example, but this will done in the near future.

The semantics of all of these samples were completely covered by the CIDOC CRM. There were however
wide differences in the complexity and the degree of automation that could be achieved. We comment
in the sequence on the effort, tools, semantics and automation:

We have used a commercial tool for all transformations. The target files are XML instances of the most simple
DTD, which represents correctly the CIDOC CRM semantics and allows to create instances structurally
equivalent to correct RDF instances of a full RDFS version of the CIDOC CRM. The target files can be read
naturally using an xsl file making the properties visible and are on-line available on 
http://cidoc.ics.forth.gr/data_transformations.html.

The transformation was done by Iraklis Karvasonis, a graduate student in computer science with no museum
background,  assisted by me, Martin Doerr, on the data field semantics. For each example, about 2 full days
were needed to identify the sample schema-to-CRM mappings, and about a week to implement and test the
mappings. There is a straight-forward step to wrap the whole sample in an XML instance, which takes
longer for deeply nested tables, and then the semantic mapping from XML to XML.

A month was spent studying the tool, as well as a longer time playing with different representations and
identifying semantic errors, which could have been avoided with a domain expert on-site. Given that effort,
the NMD and Natural History London sample can be transformed without manual intervention. No single
line of programming was done.

The 2 days spent basically in the mapping was 95% about understanding the source schema semantics, from
interpretation of names, data examples and rare comments.This has nothing to do with the CRM itself,
except for the fact that it is precise in its meanings, as required for an effective information integration.

In detail:

The NMD data are analytical in the necessary detail to allow for complete automatic transformation. Two
default assumptions not obvious from the data could be clarified with the creator and expanded in the
CRM. As the NMD database uses dynamic types for events, a full mapping of the NMD event types to
CRM classes could have improved the mapping, but was discarded because of resource constraints.
For an Internet presentation the data could be even more compressed, if the internal NMD identifiers were
omitted. These are not required by the CRM, but were left to show the level of detail the CRM is capable to
capture. Individual lengthy identifiers like : "NMD System ID: 750", "2297 Actor", were choosen
to show were global identifiers could be used in order to facilitate information integration in
very large repositories. This could have been done in a more consistent manner. 

The resulting CRM data are more compact than a Relational form and exhibit more explanatory
schema semantics.

The Clayton Herbarium sample is equally analytical as the NMD, even though is is encoded in one "flat"
table. This means, that parallel fields must be interpreted as dependent data paths. This is more complex
but not particularly difficult. Even though it is not in any "normal form", e.g. assigning the same fields once
again for a second event, it can be mapped without any difficulty. The logic behind is fairly complex,
reasoning about classification, which is totally foreseen in the CRM. We did not have any support in its
interpretation, so some potential errors are not due to the CRM but to us and must be corrected in the
future with the experts. One link NOT present in the CRM, and also NOT present in the Clayton schema,
but implicit in the data may be useful to have in the CRM: That the specimen is PROTOTYPE for the
creation of a species or genus.

The AMOL data present a difficulty of different kind: Fields with weak semantics like "description",
"statement" and "made note". These seem to be pretty much functional as formatting means, in the
tradition of museum catalogs, but cannot be used to interpret semantics. We could have done still
a good job, if some disciplined use of separators would have been applied. As the data are now,
automatic interpretation needs the use of background knowledge: Place name, person name,
organisation name, materials and object type authorities, heuristics and eventually natural language
interpretation. With these means, still are fairly complete job could be done automatically. We did not
have the resources, and have created to examples. The first a mapping of all uninterpretable texts to
a CRM "has note" property, the second a complete interpretation by hand. The latter shows, that the
meaning is completely captured except may be the "subject" field, which seems to be a heterogeneous
notion from the libraries world not contained in the CRM. (Heterogeneous meaning that it changes
interpretation with respect to the object depending on the object category).

Summarizing, we could demonstrate with this test, that the CIDOC CRM captures adequately and effectively
the domain of museum data, minor improvements notwithstanding, which will be taken into account in the
CRM standardisation process. Adequate meaning, that the CIDOC CRM provides a comparable or higher
expressive power than the source schemata. Effective meaning, that the size of the produced raw data is
comparable to the source and there is no loss of meaning in the transformation.

The complexity of mapping is typically due to the intrinsic complexity of interpreting cultural data sources, and in no means introduced by the
CRM.

As with the AMOL data, it could be shown that the CIDOC CRM can be useful to design and introduce
a moderate structuring to facilitate semantic interpretation, which is easily comprehensive by end-user
documentalists.

The Clayton data show, that this structuring needs in now ways be complex and deep as the CRM, nor that
the end user needs to fully understand the CRM. All data samples show, that the CRM instances are
comprehensive, even though the presented form was NOT designed for presentation, but to
render an understanding of the machine interpretable RAW DATA themselves.

CRM instances are data ready for automatic integration, given persons etc. can sufficiently
be identified globally - again a general problem of the process of integration and not of the CRM.

The test shows, that a non-domain expert with usual knowledge in handling IT tools can execute the
transformation with an affordable short advice from a domain expert knowledgeable also about the CRM.

This advise is once per database, and not per data, if data are sufficiently structured. This intellectual
investment cannot be avoided in any intelligent data integration, which tries to preserve and to respect
the intellectual qualities of our cultural heritage information. As this investment comes ones per
schema, its cost is small compared to the cost of designing and implementing the source data structure
itself. This is precisely the reason to have an International Standard: One such mapping should be
sufficient to solve global semantic interoperability. 

Deficiencies of the CIDOC CRM could not be identified. We collaborate closely with Harmony on the
harmonization with the ABC model, which seems to have strengths in areas so far not addressed by
the CRM, as e.g. performing arts, copyright issues, evolution of electronic documents and others.
(See: http://cidoc.ics.forth.gr/crmgroup_activities.html,
   "Working Group on Ontology Harmonization:"
  and http://cidoc.ics.forth.gr/docs/rome_full_rep_v2.doc). 

We kindly invite everybody to provide us with any kind of feedback, that could be useful for
our work and the achievement of the best standard for all of us.

Best regards,


Martin Doerr

Chair, 
CIDOC CRM Special Interest Group.
http://cidoc.ics.forth.gr/index.html

-- 

--------------------------------------------------------------
 Dr. Martin Doerr              |  Vox:+30(81)391625          |
 Senior Researcher             |  Fax:+30(81)391609          |
 Project Leader SIS            |  Email: martin at ics.forth.gr |
                                                             |
               Centre for Cultural Informatics               |
               Information Systems Laboratory                |
                Institute of Computer Science                |
   Foundation for Research and Technology - Hellas (FORTH)   |
                                                             |
 Vassilika Vouton,P.O.Box1385,GR71110 Heraklion,Crete,Greece |
                                                             |
         Web-site: http://www.ics.forth.gr/proj/isst         |
--------------------------------------------------------------



More information about the Crm-sig mailing list