[Crm-sig] Pages reproduced as spreads

Jim Salmons jim.salmons at factminers.org
Sat Mar 11 00:28:17 EET 2017

To All Fellow SIG Members,


The current discussion about print-page numbers and digital item IDs is an ideal opportunity for me to provide an update to the group on my current #DATeCH2017 submissions and current activity working toward the development of a #GTS (Ground Truth Storage) format for magazines, newspapers, and related serial publications.


As FactMiners, our #CitizenScience research project, I am working with the good folks at PRImA, the Pattern Recognition and Image Analysis research lab at the U. of Salford. Through this collaboration I am developing FactMiners' MAGAZINE #GTS format which is based on a #cidocCRM/FRBRoo/PRESSoo 'ontological stack.' Our goal is to provide integrated complex document structure and content depiction models in support of eResearch and machine learning access to digital collections of historic documents. The FactMiners' MAGAZINE format is being evolved as a superset of PRImA's PAGE #GTS format.


As some of you may recall from my self-introductory comments upon joining the SIG, I believe the complementary branches of "things" and "activity/time" in the CRM lends itself to being used as an executable metamodel for software design and development, and not just used as a descriptive ontology. This idea was among the central ideas of my first submission to this year's DATeCH conference. As we "ground truthed" datasets of Softalk magazine -- curating the table of contents, advertiser index, mastheads, etc. of this early microcomputer magazine -- we developed a number of structure and content revealing datasets that were based on self-referential content within the magazine. 


For example, in order to begin building a visual repository of Softalk advertisements based on a fine-grained pattern language expressed in PRESSoo Issuing Rules and Issuing Rule Changes, the 7,158 sightings of advertisements in the 48 issues of the magazine required that we be able to transform print page number references to the "leaf" digital image IDs used by the Internet Archive where the Softalk magazine digital collection is maintained. In fact, the print page number (ppg) to 'leaf' ID mapping became an issue with every dataset we wanted to explore.


This led to our development of our second #DATeCH2017 submission, "Print-Page Number to 'Leaf' ID Mapping in Support of eResearch and Machine-Learning at the Internet Archive." In this paper we identify the foundational nature of the "ppg2leaf" tuple based on an exploration of relevant #cidocCRM functional subsystems. We document many of the situations where digitization can result in discontinuous mapping of print page numbers to digitized images. Our Softalk collection was digitized by the Internet Archive's regional scanning center and had an impressive but insufficient "ppg2leaf" map of over 70%. This tuple-match was a side-effect based on the vigilance of our scanners asserting page numbers during the center's workflow ingestion process. To determine if your experience was typical, we examined 265 computer magazine collections at the Archive and found only 29 individual print page numbers to leaf ID tuples in over 1.4M pages contained in these collections! So, as described in our second paper, we have developed the Python-based "ppg2leaf_ferret" app as a metadata discovery and curation tool in support of eResearch and machine learning at the Internet Archive.


Using the ppg2leaf_ferret we curated the print page number to digital image ItemID dataset and created the first implementation of FactMiners MAGAZINE format #GTS metadata files. The Softalk collection is, as reported in this announcement (https://goo.gl/XxMcqe), the first digital magazine collection at the Internet Archive to provide a set of magazine-specific Ground Truth Storage metadata files including the all-important ppg2leaf_map. At the above announcement link, you will find links to ResearchGate.net pre-prints of our #DATeCH submissions, embedded video project updates showing progress in the development of the "ferret" app, and links to the initial release of our publication and issue level MAGAZINE #GTS files in the Softalk collection at the Archive.


For SIG members who may not have a free ResearchGate user account, here are links on my OneDrive cloud storage to our #DATeCH2017 submissions:




and here:




This is an already overly long update, so I will wrap up my current contribution to this print page number and itemID conversation. But as will become clear to any SIG member who explores the links provided above, I will soon be asking for some "best practice" modeling recommendations about how to support the linkage between an instance-specific MAGAZINE #GTS metamodel and its cited reference models (in our initial case being the #cidocCRM/FRBRoo/PRESoo 'stack').


As Always Happy-Healthy Vibes to All,

-: Jim :-


    Jim Salmons 

    Twitter: @Jim_Salmons


    www.FactMiners.org (Our #CitizenScience project)

    www.SoftalkApple.com (Our #DigitalHistory project)

    www.medium.com/@Jim_Salmons/ (my #CognitiveComputing/#DigitalHumanities articles)


[[Note: snipping the long thread of interesting conversation which will be reflected in the list archive]]

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.ics.forth.gr/pipermail/crm-sig/attachments/20170310/e8b51e02/attachment.html>

More information about the Crm-sig mailing list