Wednesday, September 14th



Overview| Wednesday| Thursday| Friday

08:45-09:30 Session: Registration
09:30-09:45 Session 1: Welcome and Introduction
09:45-10:45 Session 2: Keynote I


Research Infrastructures, or How Document Engineering, Cultural Heritage, and Digital Humanities Can Go Together

ABSTRACT. Research Infrastructures (RIs) are one of the key concepts in Horizon 2020, the European Commission’s Research programme. 2,7 billion EUR are available for projects under the RI programme. The talk will describe some of the main characteristics of RIs and introduce the H2020 Recognition and Enrichment of Archival Documents (READ) project which is dedicated to setting up a highly specialized service platform and making available some of the state-of-the-art technology in pattern recognition and document engineering, namely Handwritten Text Recognition, Automatic Writer Identification, and Keyword Spotting. Archives and libraries, as well as humanities scholars and the general public will be enabled to use the service platform which will improve access to cultural heritage, advance research in humanities and encourage a broad audience to investigate their personal family history.

10:45-11:15 Coffee Break

11:15-12:30 Session 3: Layouts and Publishing


A general framework for globally optimized pagination

ABSTRACT. Pagination problems deal with questions around transforming a source text stream into a formatted document by dividing it up into individual columns and pages, including adding auxiliary elements that have some relationship to the source stream data (such as figures or footnotes) but may allow a certain amount of variation in placement.

Traditionally the pagination problem has been approached by separating it into one of micro-typography (e.g., breaking text into paragraphs, also known as h\&j) and one of macro-typography (e.g., taking a galley of already formatted paragraphs and breaking them into columns and pages) without much interaction between the two.

While early solutions for both problem spaces used simple greedy algorithms, Knuth and Plass introduced in the ’80s a global-fit algorithm for line breaking that optimizes the breaks across the whole paragraph~[1]. This algorithm was implemented in \TeX’82{} and has since kept its crown as the best available solution for this space.

However, for macro-typography there has been no (successful) attempt to provide globally optimized page layout: all systems to date (including \TeX{}) use greedy algorithms for pagination. Various problems in this area have been researched (e.g., [2-5]) and the literature documents some prototype development. But none of these prototypes have been made widely available to the research community or ever made it into a generally usable and publicly available system.

This paper presents a framework for a global-fit algorithm for page breaking based on the ideas of Knuth/Plass. It is implemented in such a way that it is directly usable without additional executables with any modern \TeX{} installation. It therefore can serve as a test bed for future experiments and extensions in this space. At the same time a cleaned-up version of the current prototype has the potential to become a production tool for the huge number of \TeX{} users world-wide. The paper also discusses two already implemented extensions that increase the flexibility of the pagination process: the ability to automatically consider existing flexibility in paragraph length (by considering paragraph variations with different numbers of lines~[6]) and the concept of running the columns on a double spread a line long or short. The paper concludes with a discussion of the overall approach, its inherent limitations and directions for future research.

[1] D. E. Knuth and M. F. Plass. Breaking paragraphs into lines. Software – Practice and Experience, 11(11):1119-1184, 1981.

[2] A. Br\”uggemann-Klein, R. Klein, S, Wohlfeil. On the pagination of complex documents. In Computer Science in Perspective: Essays Dedicated to Thomas Ottmann, R, Klein, H.-W. Six, L. Wegner (eds). Springer: Heidelberg, 2003; 49-68.

[3] C. Jacobs, W. Li, and D. Salesin. Adaptive document layout via manifold content. In Proceedings of Workshop on Web Document Analysis, Edinburgh, 2003.

[4] A. Holkner. Global multiple objective line breaking. Master’s Thesis, School of Computer Science and Information Technology, RMIT University, Melbourne, Australia, 2006.

[5] P. Ciancarini et al. High-quality pagination for publishing. Software – Practice and Experience, 2012; 42:733-751.

[6] T. Hassan and A. Hunter. Knuth-Plass revisited: Flexible line-breaking for automatic document layout. In Proceedings of the 2015 ACM Symposium on Document Engineering, Lausanne 2015.


Aesthetic Measures for Document Layouts: Operationalization and Analysis in the Context of Marketing Brochures

ABSTRACT. Designing layouts that are perceived as pleasant by the viewer is no easy task: It requires a wide variety of skills including a sense for aesthetics. When a large amount of documents with different contents each has to be created, one of the bottlenecks is to create appealing layouts manually for these documents. An automation of aesthetic layout creation is increasingly important. A prerequisite for this are automatable algorithms to measure aesthetics. While literature proposes basic theoretical fundamentals and mathematical formulas for aesthetic measures, potential solutions for the operationalization of those have not been reported yet. We present the challenges and lessons learned from operationalizing 36 aesthetics measures, derived from literature, for the context of marketing brochures. The aesthetics of 744 brochure pages of ten major retailers were measured. We found very strong and highly significant correlations between at least 11 of the aesthetic measures indicating that they represent 5 latent aesthetic concepts. Still, most of the measures were found to be independent in our sample and cover a wide range of different aesthetic concepts. Nevertheless, our results suggest that retailers optimize especially for some of the measures. In terms of the aesthetic measures, they seem to design brochure pages all over in the same way, irrespective of product category or the page type. We propose to consider the quality values derived from the analysis of the measured brochures a target vector for automated document layout creation of aesthetic marketing brochures.


METIS: A Multi-faceted Hybrid Book Learning Platform

ABSTRACT. Today, students are offered a wide variety of alternatives to printed material for the consumption of educational content. Previous research suggests that, while digital content has its advantages, printed content still offers benefits that cannot be matched by digital media. This paper introduces the Meaningful Education and Training Information System (METIS), a multi-faceted hybrid book learning platform. The goal of the system is to provide an easy digital-to-print-to-digital content creation and reading service. METIS incorporates technology for layout, personalization, co-creation and assessment. These facilitate and, in many cases, significantly simplify common teacher/student tasks. Our system has been demonstrated at several international education events, partner engagements, and pilots with local universities and high schools. We present the system and discuss how it enables hybrid learning.

12:45-14:15 Lunch Break (incl. BoF)

BoF begins at 1330 in the dining area

14:15-15:30 Session 5: XML & Data Modelling


Digital Preservation Based on Contextualized Dependencies

ABSTRACT. Most of existing efforts in digital preservation have focused on extending the life of documents beyond their period of creation, without taking into account intentions and assumptions made. However, in a continuously evolving setting, knowledge about the context of documents is nearly mandatory for continuous understanding, use, care, and sustainable governance of them. In this work we propose a method that considers the preservation of a number of interdependent digital entities, including documents, in conformance with context related information. A change that influences one of these objects can be propagated to the rest of the objects via analysis of their represented dependencies. We propose to represent dependencies not only as simple links but as complex, semantically rich, constructs that encompass context-related information. We illustrate how this method can aid in fine-grained context-aware change propagation and impact analysis with a case study.


Schema-aware extended Annotation Graphs

ABSTRACT. Multistructured (M-S) documents were introduced as an answer to the need of ever more expressive data models for scholarly annotation, as experienced in the frame of Digital Humanities. Many proposals go beyond XML, that is the gold standard for annotation, and allow the expression of multilevel, concurrent annotation. However, most of them lack support for algorithmic tasks like querying and validation, despite those being central in most of their application contexts. In this paper, we focus on two aspects of annotation: data model expressiveness and validation. We introduce extended Annotation Graphs (eAG), a highly expressive graph-based data model, fit for the enrichment of multimedia resources. Regarding validation of M-S documents, we identify algorithmic complexity as a limiting factor. We advocate that this limitation may be bypassed provided validation can be checked by construction, that is by constraining the shape of data during its very manufacture. So far as we know, no existing validation mechanism for graph-structured data meets this goal. We define here such a mechanism, based on the simulation relation, somehow following a track initiated in Dataguides. We prove that thanks to this mechanism, the validity of M-S data regarding a given schema can be guaranteed without any algorithmic check.


NCM 3.1: A Conceptual Model for Hyperknowledge Document Engineering

ABSTRACT. Currently, there is a semantic gap in the context of multimedia documents and content meaning. Most of multimedia documents available today are agnostic to data semantics and their specification language offer little to ease authoring and mechanisms to their players so they can retrieve and present meaningful content to improve user experience. In this paper, we present the main entities of the version 3.1 of the Nested Context Model (NCM), which concentrate efforts at integrating support for enriched knowledge description to the model. This extension enables the specification of relationships between knowledge descriptions in the traditional hypermedia way, composing what we call hyperknowledge in this paper. NCM previous version (NCM 3.0) is a conceptual model for hypermedia document engineering. NCL (Nested Context Language), which is part of international standards and ITU recommendations, is an XML application language that was engineered according to NCM 3.0 definitions. The NCM extensions discussed in this paper contribute not only for advances in the NCL specifications, but mainly as a conceptual model for hyperknowledge document engineering.

15:30-16:00 Coffee Break

16:00-17:00 Session 6: ProDoc: Doctoral Consortium


A Language Theoretical Framework For The Integration Of Arts And Humanities Research Data
SPEAKER: Tobias Gradl

ABSTRACT. The research data landscape of the arts and humanities is composed of a large and diverse set of collections containing digital or digitized resources of particular academic contexts. The presented dissertation project is based on the primary hypothesis that there are implicit and explicit rules, which define data in their generative context, the domain knowledge. Such rules can be defined in terms of Domain Specific Languages, which allow the explication of knowledge about data, while being (1) expressive enough for many use cases and (2) based on a solid formal foundation. The resulting framework can be applied to real information needs in the arts and humanities and especially separate logical from technical aspects of data integration.


Towards supporting multimodal and multiuser interactions in multimedia languages
SPEAKER: Alan Guedes

ABSTRACT. Multimedia languages—e.g. HTML, SMIL, and NCL (Nested Context Language)—are declarative programming languages focused on specifying multimedia applications using media and time abstractions. Traditionally, those languages focus on synchronizing a multimedia presentation and on supporting limited user interactions. Their support for media abstractions aim at graphical user interfaces (GUIs) by offering elements such as text (e.g. HTML’s <p>, SMIL’s <text>), graphics (e.g. <img>), videos (e.g. HTML’s <video>). Additionally, their support for user interactions also aims at GUIs, as they offer abstractions only for mouse (e.g. HTML’s onClick, NCL’s onSelection) and keyboard (e.g. HTML and SMIL’s keyPress, and NCL’s onKeySelection) recognitions. However, current advances in recognition technologies—e.g. speech, touch and gesture recognition—have given rise to a new class of multimodal user interfaces (MUIs) and the possibility of developing multiusers-aware multimedia systems. Throughout our research, we argue that multimedia language models should take advantage of those new possibilities, and we propose to extend existing multimedia languages with multimodal and multiuser abstractions.

17:00-18:15 Session 7: Text Analysis I: Similarity



Using a Dictionary and n-gram Alignment to Improve Fine-grained Cross-Language Plagiarism Detection

ABSTRACT. The Web offers fast and easy access to a wide range of documents in various languages, and translation and editing tools provide the means to create derivative documents fairly easily. This leads to the need to develop effective tools for detecting cross-language plagiarism. Given a suspicious document, cross-language plagiarism detection comprises two main subtasks: retrieving documents that are candidate sources for that document and analyzing those candidates one by one to determine their similarity to the suspicious document. In this paper we focus on the second subtask and introduce a novel approach for assessing cross-language similarity between texts for detecting plagiarized cases. Our proposed approach has two main steps: a vector-based retrieval framework that focuses on high recall, followed by a more precise similarity analysis based on dynamic text alignment. Experiments show that our method outperforms the methods of the best results in PAN-PC-2012 and PAN-PC-2014 in terms of plagdet score. We also show that aligning n-gram units, instead of aligning complete sentences, improves the accuracy of detecting plagiarism.


Relaxing Orthogonality Assumption in Conceptual Text Document Similarity

ABSTRACT. By reflecting the degree of closeness or separation of documents, text similarity measure plays the key role in text mining. Traditional measures, for instance cosine similarity in the bag of words model, assume documents are represented in an orthogonal space formed by words. Words are assumed independent from each other and lexical overlap among documents indicate their similarity. This assumption is also made in the bag of concepts representation of documents. This paper proposes new semantic similarity measures without relying on the orthogonality assumption. By employing Wikipedia as an external resource, we introduce five similarity measures using concept-concept relatedness. Experimental results on real text datasets reveal that eliminating the orthogonality assumption improves the quality of text clustering algorithms.


Enhancing the Searchability of Page-Image PDF Documents Using an Aligned Hidden Layer from a Truth Text

ABSTRACT. The search accuracy achieved in a PDF image-plus-hidden- text (PDF-IT) document depends upon the accuracy of the optical character recognition (OCR) process that produced the searchable hidden text layer. In many cases recognising words in a blurred area of a PDF page image may exceed the capabilities of an OCR engine. This paper describes a project to replace an inadequate hidden textual layer of a PDF-IT file with a more accurate hidden layer produced from a `truth text’. The alignment of the truth text with the image is guided by using OCR- provided page-image co-ordinates, for those glyphs that are correctly recognised, as a set of fixed location points between which other truth-text words can be inserted, and aligned with blurred glyphs in the image. Results are presented to show the much enhanced searchability of this new file when compared to that of the original file, which had an OCR- produced hidden layer with no truth-text enhancement.