The P2P-IS project of L3S investigates information retrieval and data mining in Large-scale distributed networks, such as P2P networks, desktop sharing networks, and digital libraries. The simulated networks typically lie in the range of millions of nodes, and with several gigabytes of information. We studied and proposed algorithms for distributed indexing, clustering, classification, and near-duplicate detection, involving text and multimedia data. Due to the scale of the simulations, the RRZN cluster has been very important for our work.
Publications:
Word sense discrimination is the first, important step towards automatic detection of language evolution within large, historic document collections. By comparing found word senses over time, important information can be revealed and used to improve understanding and accessibility of a digital archive. Algorithms for word sense discrimination have been developed while keeping today’s language in mind and have thus been evaluated on well selected, modern datasets. The quality of the word senses found in the discrimination step has a large impact on the detection of language evolution. Therefore, as a first step, we verify that word sense discrimination can successfully be applied to digitized historic documents and that the results correctly correspond to word senses. Because accessibility of digitized historic collections is influenced also by the quality of the optical character recognition (OCR), as a second step we investigate the effects of OCR errors on word sense discrimination results. All evaluations are performed on The Times Archive, a collection of newspaper articles from 1785-1985. This work is done as part of an ongoing project.
We analyzed a large snapshot of the Semantic Web, comprising 1 billion RDF statements and 182 million distinct URIs, and found out that URIs typically follow a Prefix-Infix(-Suffix) schema. The Prefix is indicative of the source, the Infix is typically a (global) identifier, whereas the optional Suffix usually corresponds to the format of the resource (e.g., .rdf). Consequently, the Infix constitutes the most distinguishing part of a URI and can be used to match entities. Indeed, our thorough experimental study proved that considering as duplicates the entities that have share the same Infix has an F-Measure of over 77%.
Publications:
Entithy Resolution is the process of identifying among a set of entities, those referring to the same real-world object. At its core, it is a quadratic task and to make it scalable, blocking techniques are typically employed. They group entities into blocks and compare only the entities included in each block. However, existing blocking techniques are not applicable to the user-generated data of the Web 2.0, since it exhibits high levels of noise and unprecedented heterogeneity (i.e., number of different attributes). The reason is that they typically rely on a-priori selected attribute names to cluster entities into blocks, which is not feasible in these information spaces, due to their heterogeneity. In this work, we propose a novel blocking technique that is attribute-agnostic (i.e., ignores attribute names) and relies exclusively on attribute values to achieve high effectiveness. It is based on the principle that each pair of duplicates typically has (at least) one value in common. Blocks are thus built on the tokens of attribute values, and each entity is placed in multiple blocks. This results in a high number of comparisons, and, consequently, in low efficiency. To ameliorate this problem, we further propose a set of efficiency techniques that discard unnecessary comparisons: block purging, block scheduling, duplicate propagation and block pruning. The overall method is scalable, requiring on average 100 comparisons per entity.
Publications:
Regionales Rechenzentrum für Niedersachsen - URL: www.rrzn.uni-hannover.de/cluster_l3s.html?&L=1
Dr. Paul Cochrane, Last Change: 02.03.2012
Copyright Gottfried Wilhelm Leibniz Universität Hannover