Logo: Leibniz Universität Hannover

Forschungszentrum L3S

Forschungszentrum L3S Webseite

Projekte:

  • P2P-IS
  • On the Applicability of Word Sense Discrimination on 201 Years of Modern English
  • The Missing Links: Discovering Hidden Same-as Links among a Billion of Triples
  • Efficient Entity Resolution for Large Heterogeneous Information Spaces

P2P-IS

Odysseas Papapetrou

The P2P-IS project of L3S investigates information retrieval and data mining in Large-scale distributed networks, such as P2P networks, desktop sharing networks, and digital libraries. The simulated networks typically lie in the range of millions of nodes, and with several gigabytes of information. We studied and proposed algorithms for distributed indexing, clustering, classification, and near-duplicate detection, involving text and multimedia data. Due to the scale of the simulations, the RRZN cluster has been very important for our work.

Publications:

  • Odysseas Papapetrou, Wolf Siberski, Wolfgang Nejdl:
    PCIR: Combining DHTs and peer clusters for efficient full-text P2P indexing. Computer Networks 54(12): 2019-2040 (2010)
  • Odysseas Papapetrou, Wolf Siberski, Wolfgang Nejdl:
    Cardinality estimation and dynamic length adaptation for Bloom filters. Distributed and Parallel Databases 28(2-3): 119-156 (2010)
  • Odysseas Papapetrou, Wolf Siberski, Norbert Fuhr:
    Text Clustering for Peer-to-Peer Networks with Probabilistic Guarantees. ECIR 2010: 293-305
  • Odysseas Papapetrou, Sukriti Ramesh, Stefan Siersdorfer, Wolfgang Nejdl:
    Optimizing Near Duplicate Detection for P2P Networks. Peer-to-Peer Computing 2010: 1-10

Verifying the Applicability of Word Sense Discrimination on a Collection of Modern English

Gideon Zenz, Thomas Risse, Nina Tahmasebi and Kai Niklas

Word sense discrimination is the first, important step towards automatic detection of language evolution within large, historic document collections. By comparing found word senses over time, important information can be revealed and used to improve understanding and accessibility of a digital archive. Algorithms for word sense discrimination have been developed while keeping today’s language in mind and have thus been evaluated on well selected, modern datasets. The quality of the word senses found in the discrimination step has a large impact on the detection of language evolution. Therefore, as a first step, we verify that word sense discrimination can successfully be applied to digitized historic documents and that the results correctly correspond to word senses. Because accessibility of digitized historic collections is influenced also by the quality of the optical character recognition (OCR), as a second step we investigate the effects of OCR errors on word sense discrimination results. All evaluations are performed on The Times Archive, a collection of newspaper articles from 1785-1985.  This work is done as part of an ongoing project.

The Missing Links: Discovering Hidden Same-as Links among a Billion of Triples

George Papadakis

We analyzed a large snapshot of the Semantic Web, comprising 1 billion RDF statements and 182 million distinct URIs, and found out that URIs typically follow a Prefix-Infix(-Suffix) schema. The Prefix is indicative of the source, the Infix is typically a (global) identifier, whereas the optional Suffix usually corresponds to the format of the resource (e.g., .rdf). Consequently, the Infix constitutes the most distinguishing part of a URI and can be used to match entities. Indeed, our thorough experimental study proved that considering as duplicates the entities that have share the same Infix has an F-Measure of over 77%.

Publications:

  • Georgios Papadakis,  Gianluca Demartini, Philipp Kärger, Peter Fankhauser
    The Missing Links: Discovering Hidden Same-as Links among a Billion of Triples
    In the 12th International Conference on Information Integration and Web-based Applications & Services (iiWAS), November 2010, Paris, France.

Efficient Entity Resolution for Large Heterogeneous Information Spaces

George Papadakis

Entithy Resolution is the process of identifying among a set of entities, those referring to the same real-world object. At its core, it is a quadratic task and to make it scalable, blocking techniques are typically employed. They group entities into blocks and compare only the entities included in each block. However, existing blocking techniques are not applicable to the user-generated data of the Web 2.0, since it exhibits high levels of noise and unprecedented heterogeneity (i.e., number of different attributes). The reason is that they typically rely on a-priori selected attribute names to cluster entities into blocks, which is not feasible in these information spaces, due to their heterogeneity. In this work, we propose a novel blocking technique that is attribute-agnostic (i.e., ignores attribute names) and relies exclusively on attribute values to achieve high effectiveness. It is based on the principle that each pair of duplicates typically has (at least) one value in common. Blocks are thus built on the tokens of attribute values, and each entity is placed in multiple blocks. This results in a high number of comparisons, and, consequently, in low efficiency. To ameliorate this problem, we further propose a set of efficiency techniques that discard unnecessary comparisons: block purging, block scheduling, duplicate propagation and block pruning. The overall method is scalable, requiring on average 100 comparisons per entity.

Publications:

  • Georgios Papadakis, Ekaterini Ioannou, Claudia Niederée, Peter Fankhauser, Efficient Entity Resolution for Large Heterogeneous Information Spaces, Proceedings of the 4th ACM International Conference on Web Search and Data Mining (WSDM), February 2011, Hong Kong, China.

 

Leibniz Universität IT Services - URL: www.rrzn.uni-hannover.de/cluster_l3s.html
 
Dr. Paul Cochrane, Letzte Änderung: 02.03.2012
Copyright Gottfried Wilhelm Leibniz Universität Hannover